Archive for April 7, 2009

Today I was thinking about a problem I had this week with site recovery manager and thought id post something just to keep track of the errors and in case someone had the same problem. Let me paint a picture.

Protected Site

Virtual Center Server with SRM installed

Celerra storage replication adapter

Three  node VMware HA/DRS cluster.

Celerra NS120 presenting 3 x iSCSI luns to production ESX hosts.

Recovery Site

Virtual Center Server with SRM installed

Celerra storage replication adapter

Single ESX host.

Celerra NS20 presenting 3 x read only  iSCSI luns to recovery site ESX host.

The Problem

As noted above I have 3 luns replicating from the NS120 to the NS20, they are all part of the same protection group configured in SRM. The largest Lun contains the virtual machine OS files, the 2nd and 3rd lun are SQL Logs and TempDB for one of the protected VMs.

When I kicked off the SRM test I noticed that only 2 of the 3 luns (logs and tempdb) were being snapped and presented to the ESX host at the recovery site, So of course the Test failed hideously when trying to start the VM’s which raises an interesting question, shouldn’t the “Prepare Stroage” recovery step warn that 1 of the expected Luns configured in the “Array Manager” SRM section failed to present to the ESX host? Rather than it failing with ” Failed to recover datastore:” at the point of trying to start the Virtual Machine.

I went and grabbed the logs from the VC/SRM server and started to look through them to see what I could find.

The three replicated luns are part of shadow group ‘shadow-group-3685’

          primaryUrl = “sanfs://vmfs_uuid:49b58a75-48b0e5f8-e7c8-00151777f2cc/”,

         peerInfo = (dr.san.Lun.PeerInfo) [

            (dr.san.Lun.PeerInfo) {

               dynamicType = <unset>,

               arrayKey = “CK2000822009760000”,

               lunKey = “fs43_T1_LUN1_CKM00085000953_0000_fs41_T2_LUN1_CK200082200976_0000”,

 

         primaryUrl = “sanfs://vmfs_uuid:49b82702-3e24f2c8-6e7f-00151777f2cc/”,

         peerInfo = (dr.san.Lun.PeerInfo) [

            (dr.san.Lun.PeerInfo) {

               dynamicType = <unset>,

               arrayKey = “CK2000822009760000”,

               lunKey = “fs45_T1_LUN2_CKM00085000953_0000_fs42_T2_LUN2_CK200082200976_0000”,

 

         primaryUrl = “sanfs://vmfs_uuid:49b82718-7a028960-9092-00151777f2cc/”,

         peerInfo = (dr.san.Lun.PeerInfo) [

            (dr.san.Lun.PeerInfo) {

               dynamicType = <unset>,

               arrayKey = “CK2000822009760000”,

               lunKey = “fs47_T1_LUN3_CKM00085000953_0000_fs46_T2_LUN3_CK200082200976_0000”,

 

 

SRA creates LUN snapshots.

 

[2009-03-30 12:36:49.031 ‘SecondarySanProvider’ 868 verbose] Creating lun snapshots for group ‘SRM Protected Systems’

 [#1]   <ReplicaLunKeyList>

[#1]     <ReplicaLunKey>fs43_T1_LUN1_CKM00085000953_0000_fs41_T2_LUN1_CK200082200976_0000</ReplicaLunKey>

[#1]     <ReplicaLunKey>fs45_T1_LUN2_CKM00085000953_0000_fs42_T2_LUN2_CK200082200976_0000</ReplicaLunKey>

[#1]     <ReplicaLunKey>fs47_T1_LUN3_CKM00085000953_0000_fs46_T2_LUN3_CK200082200976_0000</ReplicaLunKey>

[#1]   </ReplicaLunKeyList>

 

 

 Here we see the test fail over only presents 2 LUNs

 [2009-03-30 12:37:13.500 ‘SecondarySanProvider’ 868 info] testFailover exited with exit code 0

 [2009-03-30 12:37:13.500 ‘SecondarySanProvider’ 868 trivia] ‘testFailover’ returned <?xml version=”1.0″ encoding=”UTF-8″ standalone=”yes”?>

[#1] <Response>

[#1]     <ReturnCode>0</ReturnCode>

[#1]     <InitiatorGroupList>

[#1]         <InitiatorGroup id=”0″>

[#1]             <Initiator type=”ISCSI” id=”iqn.1998-01.com.vmware:esxdr-5a1f63a8″/>

[#1]         </InitiatorGroup>

[#1]         <InitiatorGroup id=”iScsi-fc-all”>

[#1]             <Initiator type=”iscsi” id=”iqn.1998-01.com.vmware:esxdr-5a1f63a8″/>

[#1]         </InitiatorGroup>

[#1]     </InitiatorGroupList>

[#1]     <ReplicaLunList>

[#1]         <ReplicaLun key=”fs45_T1_LUN2_CKM00085000953_0000_fs42_T2_LUN2_CK200082200976_0000″>

[#1]             <Number initiatorGroupId=”iScsi-fc-all”>128</Number>

[#1]         </ReplicaLun>

[#1]         <ReplicaLun key=”fs47_T1_LUN3_CKM00085000953_0000_fs46_T2_LUN3_CK200082200976_0000″>

[#1]             <Number initiatorGroupId=”iScsi-fc-all”>129</Number>

[#1]         </ReplicaLun>

[#1]     </ReplicaLunList>

[#1] </Response>

 

  Here we see SRM log an error about the failure (dr.san.fault.LunFailoverFailed)

 

[2009-03-30 12:39:59.484 ‘SecondarySanProvider’ 868 warning] Failed to prepare shadow vm for recovery: Unexpected MethodFault (dr.san.fault.RecoveredDatastoreNotFound) {

[#1]    dynamicType = <unset>,

[#1]    datastore = (dr.vimext.SanProviderDatastoreLocator) {

[#1]       dynamicType = <unset>,

[#1]       primaryUrl = “sanfs://vmfs_uuid:49b58a75-48b0e5f8-e7c8-00151777f2cc/”,

[#1]    },

[#1]    reason = (dr.san.fault.LunFailoverFailed) {

[#1]       dynamicType = <unset>,

[#1]       key = “fs43_T1_LUN1_CKM00085000953_0000_fs41_T2_LUN1_CK200082200976_0000”,

 

 

This problem was actually caused by the initial replication task not completing successfully, after a tone of troubleshooting the EMC Celerra support team suspected memory corruption on the DM and once rebooted the replication task completed its initial  “FULL COPY” and subsequent SRM tests completed successfully with all 3 luns being presented to the DR ESX host. 

 

As noted above I think the “Prepare Storage” recovery step should have warned one of the Luns failed to snapshot rather than fail while trying to power on VM’s.

 

If you have any thoughts on this, let me know.