Celerra Lun fails to snapshot during SRM Test

Posted: April 7, 2009 in Celerra, Site Recovery Manager, VMware

Today I was thinking about a problem I had this week with site recovery manager and thought id post something just to keep track of the errors and in case someone had the same problem. Let me paint a picture.

Protected Site

Virtual Center Server with SRM installed

Celerra storage replication adapter

Three  node VMware HA/DRS cluster.

Celerra NS120 presenting 3 x iSCSI luns to production ESX hosts.

Recovery Site

Virtual Center Server with SRM installed

Celerra storage replication adapter

Single ESX host.

Celerra NS20 presenting 3 x read only  iSCSI luns to recovery site ESX host.

The Problem

As noted above I have 3 luns replicating from the NS120 to the NS20, they are all part of the same protection group configured in SRM. The largest Lun contains the virtual machine OS files, the 2nd and 3rd lun are SQL Logs and TempDB for one of the protected VMs.

When I kicked off the SRM test I noticed that only 2 of the 3 luns (logs and tempdb) were being snapped and presented to the ESX host at the recovery site, So of course the Test failed hideously when trying to start the VM’s which raises an interesting question, shouldn’t the “Prepare Stroage” recovery step warn that 1 of the expected Luns configured in the “Array Manager” SRM section failed to present to the ESX host? Rather than it failing with ” Failed to recover datastore:” at the point of trying to start the Virtual Machine.

I went and grabbed the logs from the VC/SRM server and started to look through them to see what I could find.

The three replicated luns are part of shadow group ‘shadow-group-3685’

          primaryUrl = “sanfs://vmfs_uuid:49b58a75-48b0e5f8-e7c8-00151777f2cc/”,

         peerInfo = (dr.san.Lun.PeerInfo) [

            (dr.san.Lun.PeerInfo) {

               dynamicType = <unset>,

               arrayKey = “CK2000822009760000”,

               lunKey = “fs43_T1_LUN1_CKM00085000953_0000_fs41_T2_LUN1_CK200082200976_0000”,

 

         primaryUrl = “sanfs://vmfs_uuid:49b82702-3e24f2c8-6e7f-00151777f2cc/”,

         peerInfo = (dr.san.Lun.PeerInfo) [

            (dr.san.Lun.PeerInfo) {

               dynamicType = <unset>,

               arrayKey = “CK2000822009760000”,

               lunKey = “fs45_T1_LUN2_CKM00085000953_0000_fs42_T2_LUN2_CK200082200976_0000”,

 

         primaryUrl = “sanfs://vmfs_uuid:49b82718-7a028960-9092-00151777f2cc/”,

         peerInfo = (dr.san.Lun.PeerInfo) [

            (dr.san.Lun.PeerInfo) {

               dynamicType = <unset>,

               arrayKey = “CK2000822009760000”,

               lunKey = “fs47_T1_LUN3_CKM00085000953_0000_fs46_T2_LUN3_CK200082200976_0000”,

 

 

SRA creates LUN snapshots.

 

[2009-03-30 12:36:49.031 ‘SecondarySanProvider’ 868 verbose] Creating lun snapshots for group ‘SRM Protected Systems’

 [#1]   <ReplicaLunKeyList>

[#1]     <ReplicaLunKey>fs43_T1_LUN1_CKM00085000953_0000_fs41_T2_LUN1_CK200082200976_0000</ReplicaLunKey>

[#1]     <ReplicaLunKey>fs45_T1_LUN2_CKM00085000953_0000_fs42_T2_LUN2_CK200082200976_0000</ReplicaLunKey>

[#1]     <ReplicaLunKey>fs47_T1_LUN3_CKM00085000953_0000_fs46_T2_LUN3_CK200082200976_0000</ReplicaLunKey>

[#1]   </ReplicaLunKeyList>

 

 

 Here we see the test fail over only presents 2 LUNs

 [2009-03-30 12:37:13.500 ‘SecondarySanProvider’ 868 info] testFailover exited with exit code 0

 [2009-03-30 12:37:13.500 ‘SecondarySanProvider’ 868 trivia] ‘testFailover’ returned <?xml version=”1.0″ encoding=”UTF-8″ standalone=”yes”?>

[#1] <Response>

[#1]     <ReturnCode>0</ReturnCode>

[#1]     <InitiatorGroupList>

[#1]         <InitiatorGroup id=”0″>

[#1]             <Initiator type=”ISCSI” id=”iqn.1998-01.com.vmware:esxdr-5a1f63a8″/>

[#1]         </InitiatorGroup>

[#1]         <InitiatorGroup id=”iScsi-fc-all”>

[#1]             <Initiator type=”iscsi” id=”iqn.1998-01.com.vmware:esxdr-5a1f63a8″/>

[#1]         </InitiatorGroup>

[#1]     </InitiatorGroupList>

[#1]     <ReplicaLunList>

[#1]         <ReplicaLun key=”fs45_T1_LUN2_CKM00085000953_0000_fs42_T2_LUN2_CK200082200976_0000″>

[#1]             <Number initiatorGroupId=”iScsi-fc-all”>128</Number>

[#1]         </ReplicaLun>

[#1]         <ReplicaLun key=”fs47_T1_LUN3_CKM00085000953_0000_fs46_T2_LUN3_CK200082200976_0000″>

[#1]             <Number initiatorGroupId=”iScsi-fc-all”>129</Number>

[#1]         </ReplicaLun>

[#1]     </ReplicaLunList>

[#1] </Response>

 

  Here we see SRM log an error about the failure (dr.san.fault.LunFailoverFailed)

 

[2009-03-30 12:39:59.484 ‘SecondarySanProvider’ 868 warning] Failed to prepare shadow vm for recovery: Unexpected MethodFault (dr.san.fault.RecoveredDatastoreNotFound) {

[#1]    dynamicType = <unset>,

[#1]    datastore = (dr.vimext.SanProviderDatastoreLocator) {

[#1]       dynamicType = <unset>,

[#1]       primaryUrl = “sanfs://vmfs_uuid:49b58a75-48b0e5f8-e7c8-00151777f2cc/”,

[#1]    },

[#1]    reason = (dr.san.fault.LunFailoverFailed) {

[#1]       dynamicType = <unset>,

[#1]       key = “fs43_T1_LUN1_CKM00085000953_0000_fs41_T2_LUN1_CK200082200976_0000”,

 

 

This problem was actually caused by the initial replication task not completing successfully, after a tone of troubleshooting the EMC Celerra support team suspected memory corruption on the DM and once rebooted the replication task completed its initial  “FULL COPY” and subsequent SRM tests completed successfully with all 3 luns being presented to the DR ESX host. 

 

As noted above I think the “Prepare Storage” recovery step should have warned one of the Luns failed to snapshot rather than fail while trying to power on VM’s.

 

If you have any thoughts on this, let me know.

 

 

 

 

 

 

Comments
  1. Kiralık Vps says:

    I had a very nice article I read I recommend to my other friends. Thank you very much.

  2. Ryan says:

    Just wondering, did you check the logs created by the Celerra adapter? Did they indicate that anything wrong had happened (or I guess not happened)?

    • Brian Norris says:

      Hi Ryan.

      Yep sure did.

      The following data was noted from server_log output.

      2009-01-29 13:57:35: 13160349696: VCS: 3: ufs failed to create snap fs28_T1_LUN0_CK200083100100_0000.ckpt000_5567565935252104: NoSpace

  3. Sky says:

    At a customer’s site, EMC services came out to replicate a Celerra before shipping to a recovery site. I was brought in after that to configure SRM, and it appears that the replication targets are not set to read-only. I looked at the mounts, and there is a place to click read-only, but says that hosts are connected.

    Is the read-only setting a requirement, or just to prevent accidentally creating a datastore on that LUN? If it is a requirement, would it be easier to delete the target LUNs and recreate the replication, or is there another way?

Leave a comment