Troubleshooting Unexpected Deletion of SRM Recovery Plans.

Troubleshooting Unexpected Deletion of SRM Recovery Plans.

 

Problem Description

After restarting VMware SRM services on Protected and Recovery Site, all the recovery plans are deleted from the SRM Console.

 

Details about the issue:

The VMware SRM version which we were using was 8.1.

 

We were configuring a Test run on SRM console and due to some unforeseen reason my SRM UI went into hung state.  To fix the issue I restarted my SRM service on both Protected and DR site. After the restart I found that all my recovery plans except few are missing from the SRM UI.

 

To find the issue we started analyzing the SRM logs and below are the entries we found in the logs:

 

Log Location: C:\ProgramData\VMware\VMware vCenter Site Recovery Manager\Logs

 

Recovery site restarted.

 

 2020-05-30T13:17:23.439+10:00 info -[16956] [SRM@6800 sub=Default] VMware vCenter Site Recovery Manager 8.1.2 build-12686166 is starting

 

Protected Site marked connection as Disconnected

 

2020-05-30T13:18:56.670+10:00 verbose vmware-dr[15988] [SRM@6800 sub=RemoteSite connID=dr-admin-56ac ctxID=1dae7405] Updating DR connection status to 'disconnected'

 

 Protected Site marked connection as Connected

 

2020-05-30T13:19:58.883+10:00 verbose vmware-dr[18066] [SRM@6800 sub=RemoteSite ctxID=1dae7405] Updating DR connection status to 'connected'

 

Then the plan was removed.

 

 2020-05-30T13:19:59.345+10:00 verbose vmware-dr[13354] [SRM@6800 sub=RecoveryVMODL ctxID=50f9cfb5 opID=733bc6:709f:ee8d:3bb3] HandleRemotePlanDeletion: dr.recovery.RecoveryPlan:9fc76f5-846-4237-a386-7c89a3cf3915

 

2020-05-30T13:19:59.345+10:00 verbose vmware-dr[15292] [SRM@6800 sub=RecoveryVMODL ctxID=50f9cfb5 opID=7339c6:79f:ee8d:735] HandleRemotePlanDeletion: dr.recovery.RecoveryPlan:9e997d70-807-4837-99c9-eecefeddc0a3

 

2020-05-30T13:19:59.361+10:00 verbose vmware-dr[11672] [SRM@6800 sub=RecoveryVMODL ctxID=50f9cfb5 opID=7369c6:70f:ee8d:45c] HandleRemotePlanDeletion: dr.recovery.RecoveryPlan:eaf9bff-b0cf-41d-851c-6e72e27432

 

2020-05-30T13:19:59.788+10:00 verbose vmware-dr[10724] [SRM@6800 sub=Recovery ctxID=50f9cfb5 opID=7339c6:709f:ee8d:e713] RecoveryPlanOperationComplete: operation queue for plan a6ec73a1-43f6-424-bc58-8aeb505e710a is empty, removing.

 

2020-05-30T13:19:59.788+10:00 verbose vmware-dr[18036] [SRM@6800 sub=Recovery ctxID=50f9cfb5 opID=7339c6:70f:ee8d:3307] RecoveryPlanOperationComplete: operation queue for plan 6cde15e9-8351-410f-b58f-a6b9da56c01d is empty, removing.

 

2020-05-30T13:19:59.788+10:00 verbose vmware-dr[17840] [SRM@6800 sub=Recovery ctxID=50f9cfb5 opID=733bc6:79f:ee8d:7f58] RecoveryPlanOperationComplete: operation queue for plan 28e8cd72-996e-4fa4-9b4f-1b1c2cb66279 is empty, removing.

 

 

2020-05-28T13:19:59.772+10:00 verbose vmware-dr[15648] [SRM@6800 sub=RecoveryVMODL ctxID=50f9cfb5 opID=733b66:709f:ee8d:73e5] RemovePlanInternal[9e997d70-8027-4837-99c9-eecefeddc0a3]: Acquired DB context

 

2020-05-28T13:19:59.772+10:00 verbose vmware-dr[13384] [SRM@6800 sub=PropertyProvider ctxID=2f6ceef7 opID=733b66:709f:ee8d:73e5] RecordOp REMOVE: childEntity["9e997d70-8027-4837-99c9-eecefeddc0a3"], DrRecoveryRootFolder. Applied change to temp map.

 

2020-05-28T13:19:59.772+10:00 info vmware-dr[15648] [SRM@6800 sub=RecoveryVMODL ctxID=50f9cfb5 opID=733bc6:709f:ee8d:73e5] RemovePlanInternal[9e997d70-8027-4837-99c9-eecefeddc0a3]: Plan destroyed

 

               <<Note: Logs date and ids are changed to avoid any data integrity issues >>


 Workarounds available to fix the issue:

 

1). Create all the recovery plans manually again which is a time consuming process.

 

2). Restore from image level backup.

 

Luckily, we had image level backup configured for our SRM server and we were able to restore the previous state of our SRM server, which saved lot of manual effort.

 

 

Detailed Root Cause Analysis:

 

1). Checked logs as highlighted above in issue details.

 

2). We thought, that it could be because of  SRM SQL DB Corruption but after doing SRM DB health check, we found that Database is ok and it contains the total count of recovery plans we had.

 

3).Finally we started exploring VMware SRM Release notes from below link and found the issue and it is a known bug in the previous SRM version which they had fixed in new SRM release which is 8.3.0.1.

 

 

As per release notes, if we simultaneous restarts SRM services/server from both sites(Protected/DR) and multiple sequenced restarts in a single site within less than a minute, some of the Site Recovery Manager plug-ins might timeout during shutdown. As a result the remote site might delete the recovery plans and history reports.

 

This is what we had experienced.


 

Release notes Link:

 

https://docs.vmware.com/en/Site-Recovery-Manager/8.3/rn/srm-releasenotes-8-3.html#resolvedissues

 

Lessons Learnt :( This section is very important with every troubleshooting and I will like to highlight it)

 

·       Always make sure logging is enabled to get logs on time to troubleshoot the issue and understand the problem.

·       It is always recommended to back up your production servers.

·       Always go through Vendor release notes / Technical guides to get aware of known issues/bugs.


                                                                            -xxx-


Comments

Popular posts from this blog

Troubleshooting NonVMwareDevice Filtering in VMware SRM using Dell EMC SRDF SRA.

VMware vRealize Automation 7.x Troubleshooting Infra issues

Replacing NSX-T Certificate with Custom Certificate