Today I was thinking about a problem I had this week with site recovery manager and thought id post something just to keep track of the errors and in case someone had the same problem. Let me paint a picture.
Protected Site
Virtual Center Server with SRM installed
Celerra storage replication adapter
Three node VMware HA/DRS cluster.
Celerra NS120 presenting 3 x iSCSI luns to production ESX hosts.
Recovery Site
Virtual Center Server with SRM installed
Celerra storage replication adapter
Single ESX host.
Celerra NS20 presenting 3 x read only iSCSI luns to recovery site ESX host.
The Problem
As noted above I have 3 luns replicating from the NS120 to the NS20, they are all part of the same protection group configured in SRM. The largest Lun contains the virtual machine OS files, the 2nd and 3rd lun are SQL Logs and TempDB for one of the protected VMs.
When I kicked off the SRM test I noticed that only 2 of the 3 luns (logs and tempdb) were being snapped and presented to the ESX host at the recovery site, So of course the Test failed hideously when trying to start the VM’s which raises an interesting question, shouldn’t the “Prepare Stroage” recovery step warn that 1 of the expected Luns configured in the “Array Manager” SRM section failed to present to the ESX host? Rather than it failing with ” Failed to recover datastore:” at the point of trying to start the Virtual Machine.
I went and grabbed the logs from the VC/SRM server and started to look through them to see what I could find.
The three replicated luns are part of shadow group ‘shadow-group-3685’
primaryUrl = “sanfs://vmfs_uuid:49b58a75-48b0e5f8-e7c8-00151777f2cc/”,
peerInfo = (dr.san.Lun.PeerInfo) [
(dr.san.Lun.PeerInfo) {
dynamicType = <unset>,
arrayKey = “CK2000822009760000”,
lunKey = “fs43_T1_LUN1_CKM00085000953_0000_fs41_T2_LUN1_CK200082200976_0000”,
primaryUrl = “sanfs://vmfs_uuid:49b82702-3e24f2c8-6e7f-00151777f2cc/”,
peerInfo = (dr.san.Lun.PeerInfo) [
(dr.san.Lun.PeerInfo) {
dynamicType = <unset>,
arrayKey = “CK2000822009760000”,
lunKey = “fs45_T1_LUN2_CKM00085000953_0000_fs42_T2_LUN2_CK200082200976_0000”,
primaryUrl = “sanfs://vmfs_uuid:49b82718-7a028960-9092-00151777f2cc/”,
peerInfo = (dr.san.Lun.PeerInfo) [
(dr.san.Lun.PeerInfo) {
dynamicType = <unset>,
arrayKey = “CK2000822009760000”,
lunKey = “fs47_T1_LUN3_CKM00085000953_0000_fs46_T2_LUN3_CK200082200976_0000”,
SRA creates LUN snapshots.
[2009-03-30 12:36:49.031 ‘SecondarySanProvider’ 868 verbose] Creating lun snapshots for group ‘SRM Protected Systems’
[#1] <ReplicaLunKeyList>
[#1] <ReplicaLunKey>fs43_T1_LUN1_CKM00085000953_0000_fs41_T2_LUN1_CK200082200976_0000</ReplicaLunKey>
[#1] <ReplicaLunKey>fs45_T1_LUN2_CKM00085000953_0000_fs42_T2_LUN2_CK200082200976_0000</ReplicaLunKey>
[#1] <ReplicaLunKey>fs47_T1_LUN3_CKM00085000953_0000_fs46_T2_LUN3_CK200082200976_0000</ReplicaLunKey>
[#1] </ReplicaLunKeyList>
Here we see the test fail over only presents 2 LUNs
[2009-03-30 12:37:13.500 ‘SecondarySanProvider’ 868 info] testFailover exited with exit code 0
[2009-03-30 12:37:13.500 ‘SecondarySanProvider’ 868 trivia] ‘testFailover’ returned <?xml version=”1.0″ encoding=”UTF-8″ standalone=”yes”?>
[#1] <Response>
[#1] <ReturnCode>0</ReturnCode>
[#1] <InitiatorGroupList>
[#1] <InitiatorGroup id=”0″>
[#1] <Initiator type=”ISCSI” id=”iqn.1998-01.com.vmware:esxdr-5a1f63a8″/>
[#1] </InitiatorGroup>
[#1] <InitiatorGroup id=”iScsi-fc-all”>
[#1] <Initiator type=”iscsi” id=”iqn.1998-01.com.vmware:esxdr-5a1f63a8″/>
[#1] </InitiatorGroup>
[#1] </InitiatorGroupList>
[#1] <ReplicaLunList>
[#1] <ReplicaLun key=”fs45_T1_LUN2_CKM00085000953_0000_fs42_T2_LUN2_CK200082200976_0000″>
[#1] <Number initiatorGroupId=”iScsi-fc-all”>128</Number>
[#1] </ReplicaLun>
[#1] <ReplicaLun key=”fs47_T1_LUN3_CKM00085000953_0000_fs46_T2_LUN3_CK200082200976_0000″>
[#1] <Number initiatorGroupId=”iScsi-fc-all”>129</Number>
[#1] </ReplicaLun>
[#1] </ReplicaLunList>
[#1] </Response>
Here we see SRM log an error about the failure (dr.san.fault.LunFailoverFailed)
[2009-03-30 12:39:59.484 ‘SecondarySanProvider’ 868 warning] Failed to prepare shadow vm for recovery: Unexpected MethodFault (dr.san.fault.RecoveredDatastoreNotFound) {
[#1] dynamicType = <unset>,
[#1] datastore = (dr.vimext.SanProviderDatastoreLocator) {
[#1] dynamicType = <unset>,
[#1] primaryUrl = “sanfs://vmfs_uuid:49b58a75-48b0e5f8-e7c8-00151777f2cc/”,
[#1] },
[#1] reason = (dr.san.fault.LunFailoverFailed) {
[#1] dynamicType = <unset>,
[#1] key = “fs43_T1_LUN1_CKM00085000953_0000_fs41_T2_LUN1_CK200082200976_0000”,
This problem was actually caused by the initial replication task not completing successfully, after a tone of troubleshooting the EMC Celerra support team suspected memory corruption on the DM and once rebooted the replication task completed its initial “FULL COPY” and subsequent SRM tests completed successfully with all 3 luns being presented to the DR ESX host.
As noted above I think the “Prepare Storage” recovery step should have warned one of the Luns failed to snapshot rather than fail while trying to power on VM’s.
If you have any thoughts on this, let me know.
I had a very nice article I read I recommend to my other friends. Thank you very much.
Just wondering, did you check the logs created by the Celerra adapter? Did they indicate that anything wrong had happened (or I guess not happened)?
Hi Ryan.
Yep sure did.
The following data was noted from server_log output.
2009-01-29 13:57:35: 13160349696: VCS: 3: ufs failed to create snap fs28_T1_LUN0_CK200083100100_0000.ckpt000_5567565935252104: NoSpace
At a customer’s site, EMC services came out to replicate a Celerra before shipping to a recovery site. I was brought in after that to configure SRM, and it appears that the replication targets are not set to read-only. I looked at the mounts, and there is a place to click read-only, but says that hosts are connected.
Is the read-only setting a requirement, or just to prevent accidentally creating a datastore on that LUN? If it is a requirement, would it be easier to delete the target LUNs and recreate the replication, or is there another way?