Troubleshooting Unexpected Deletion of SRM Recovery Plans.
Troubleshooting Unexpected Deletion of SRM Recovery Plans.
Problem Description:
After restarting VMware SRM services on Protected and Recovery Site, all the recovery plans are deleted from the SRM Console.
Details about the issue:
The
VMware SRM version which we were using was 8.1.
We were configuring a Test run on
SRM console and due to some unforeseen reason my SRM UI went into hung
state. To fix the issue I restarted my SRM service on both Protected and
DR site. After the restart I found that all my recovery plans except few are
missing from the SRM UI.
To find the issue we started
analyzing the SRM logs and below are
the entries we found in the logs:
Log
Location: C:\ProgramData\
Recovery site restarted.
2020-05-30T13:17:23.439+10:00
info -[16956] [SRM@6800 sub=Default] VMware vCenter Site Recovery Manager 8.1.2
build-12686166 is starting
Protected Site marked connection as
Disconnected
2020-05-30T13:18:56.670+10:00
verbose vmware-dr[15988] [SRM@6800 sub=RemoteSite connID=dr-admin-56ac ctxID=1dae7405]
Updating DR connection status to 'disconnected'
Protected Site
marked connection as Connected
2020-05-30T13:19:58.883+10:00
verbose vmware-dr[18066] [SRM@6800 sub=RemoteSite ctxID=1dae7405] Updating DR
connection status to 'connected'
Then the plan was removed.
2020-05-30T13:19:59.345+10:00
verbose vmware-dr[13354] [SRM@6800 sub=RecoveryVMODL ctxID=50f9cfb5
opID=733bc6:709f:ee8d:3bb3] HandleRemotePlanDeletion:
dr.recovery.RecoveryPlan:9fc76f5-846-4237-a386-7c89a3cf3915
2020-05-30T13:19:59.345+10:00
verbose vmware-dr[15292] [SRM@6800 sub=RecoveryVMODL ctxID=50f9cfb5
opID=7339c6:79f:ee8d:735] HandleRemotePlanDeletion: dr.recovery.RecoveryPlan:9e997d70-807-4837-99c9-eecefeddc0a3
2020-05-30T13:19:59.361+10:00
verbose vmware-dr[11672] [SRM@6800 sub=RecoveryVMODL ctxID=50f9cfb5
opID=7369c6:70f:ee8d:45c] HandleRemotePlanDeletion: dr.recovery.RecoveryPlan:eaf9bff-b0cf-41d-851c-6e72e27432
2020-05-30T13:19:59.788+10:00
verbose vmware-dr[10724] [SRM@6800 sub=Recovery ctxID=50f9cfb5
opID=7339c6:709f:ee8d:e713] RecoveryPlanOperationComplete: operation queue for
plan a6ec73a1-43f6-424-bc58-8aeb505e710a is empty, removing.
2020-05-30T13:19:59.788+10:00
verbose vmware-dr[18036] [SRM@6800 sub=Recovery ctxID=50f9cfb5 opID=7339c6:70f:ee8d:3307]
RecoveryPlanOperationComplete: operation queue for plan
6cde15e9-8351-410f-b58f-a6b9da56c01d is empty, removing.
2020-05-30T13:19:59.788+10:00
verbose vmware-dr[17840] [SRM@6800 sub=Recovery ctxID=50f9cfb5 opID=733bc6:79f:ee8d:7f58]
RecoveryPlanOperationComplete: operation queue for plan
28e8cd72-996e-4fa4-9b4f-1b1c2cb66279 is empty, removing.
2020-05-28T13:19:59.772+10:00
verbose vmware-dr[15648] [SRM@6800 sub=RecoveryVMODL ctxID=50f9cfb5
opID=733b66:709f:ee8d:73e5] RemovePlanInternal[9e997d70-8027-4837-99c9-eecefeddc0a3]: Acquired DB context
2020-05-28T13:19:59.772+10:00
verbose vmware-dr[13384] [SRM@6800 sub=PropertyProvider ctxID=2f6ceef7
opID=733b66:709f:ee8d:73e5] RecordOp REMOVE:
childEntity["9e997d70-8027-4837-99c9-eecefeddc0a3"],
DrRecoveryRootFolder. Applied
change to temp map.
2020-05-28T13:19:59.772+10:00 info
vmware-dr[15648] [SRM@6800 sub=RecoveryVMODL ctxID=50f9cfb5
opID=733bc6:709f:ee8d:73e5]
RemovePlanInternal[9e997d70-8027-4837-99c9-eecefeddc0a3]: Plan destroyed
<<Note: Logs date and ids
are changed to avoid any data integrity issues >>
1). Create all the recovery plans
manually again which is a time consuming process.
2). Restore from image level backup.
Luckily,
we had image level backup configured for our SRM server and we were able to
restore the previous state of our SRM server, which saved lot of manual effort.
Detailed
Root Cause Analysis:
1). Checked logs as highlighted
above in issue details.
2). We thought, that it could
be because of SRM SQL DB Corruption but after doing SRM DB health check,
we found that Database is ok and it contains the total count of recovery plans
we had.
3).Finally
we started exploring VMware SRM Release notes from below link and found the
issue and it is a known bug in the previous SRM version which they had fixed in
new SRM release which is 8.3.0.1.
As
per release notes, if we simultaneous restarts SRM services/server from both
sites(Protected/DR) and multiple sequenced restarts in a single site within
less than a minute, some of the Site Recovery Manager plug-ins
might timeout during shutdown. As a result the remote site might delete
the recovery plans and history reports.
This
is what we had experienced.
Release
notes Link:
https://docs.vmware.com/en/Site-Recovery-Manager/8.3/rn/srm-releasenotes-8-3.html#resolvedissues
Lessons
Learnt :( This section is very important with every troubleshooting and I will
like to highlight it)
· Always
make sure logging is enabled to get logs on time to troubleshoot the issue and
understand the problem.
· It is
always recommended to back up your production servers.
· Always go through Vendor release notes / Technical guides to get aware of known issues/bugs.
-xxx-
Comments
Post a Comment