Last time we recovered from the loss of our NSX Data Center primary site. If you’ve not seen that post, catch up now. It’s a great read.
As mentioned, this post is part 3 of a multipart series. Find the other parts here:
- Part 1: Why and Getting Familiar
- Part 2: Bye-bye Site A!
- Part 3: This part - Site A Back from the Dead!
- Part 4: Making Site A Primary Again
To recap, the NSX Data Center control plane components (consisting of the NSX Controller cluster and the Universal Logical Distributed Router (UDLR) control VMs) can only exist on one site; the primary site. In the event of loss of the primary site the control VMs must be recreated at a secondary site to reinstate the NSX control plane. When we lost the primary site, we recreated them at secondary site to reinstate the NSX control plane.
Overview
The Lab
(Click image to zoom in)
As a refresher, here is where we are:
- NSX Controller cluster has been rebuilt on Site B
- Universal Site A UDLR control VM has been rebuilt on Site B
- Universal Site B UDLR control VM has been rebuilt on Site B
Additionally, Site B NSX Manager is now our primary manager.
TL,DR - Process Overview
To Lazy, Didn’t Read?
Again here are the process steps for those TL,DRs among us:
- Start Site A
- Check Site A NSX Manager Configuration and Status
- Demote Site A NSX Manager
- Delete Site A Controller Cluster
- Delete Site A UDLR Control VMs
- Assign Site A NSX Manager Secondary Role
- Verify Dynamic Routing Configuration of UDLRs and ESGs
- Test
Site A Start Up
OK, so site A is back from the dead. Lets get the site powered back on so we can reinstate it as an NSX secondary site in the first instance. From there we can promote it to primary again. Lets power on Site A in the following order:
- ESXi Host(s)
- vCenter Server
- NSX Manager
- Controller Cluster
- Universal Logical Distributed Router (UDLR) Control VMs
- Edge Service Gateways (ESG) VMs
Check Site-A NSX Manager Configuration and Status
Just as we did with Site B’s NSX manager during failover in part 2, let’s confirm that Site A’s NSX Manager is happy and registered with vCenter. Access Site A NSX manager via web browser (lab: https://nsx-site-a.lab), login and navigate to View Summary. Confirm that the NSX Management Components are running:
and confirm vCenter Registration Home - Manage vCenter Registration shows as green:
Demote Site A NSX Manager
Log onto Site A vCenter (lab: https://vc-site-a.lab/), navigate to Network and Security - Installation and Upgrade - Management - NSX Managers:
As you can see, both Site A and Site B NSX Managers believe that they are the primary NSX Manager. Lets look closer at the sync issue:
Fair enough, we disconnected Site B NSX Manager from Site A NSX Manager during the failover.
Select Site A NSX Manager and select Actions - Remove Secondary Manager:
Tick select Perform Operation even if NSX Manager is inaccessible and Remove:
Next, select Site A NSX Manager again and select Actions - Remove Primary Role:
Answer Yes to the warning (we’ll clean up our controllers in the next step):
Site A NSX Manager will then be placed into Transit mode:
Delete Site A Controller Cluster
Navigate to Network and Security - Installation and Upgrade - NSX Controller Nodes and select the NSX Manager in Transit (lab: NSX Manager 192.168.10.4). Select each controller in turn and select Delete, allowing time for deletion between each:
Upon deletion of the final controller, tick Proceed to Force Delete and click Delete:
Delete Site A UDLR Control VMs
Navigate to Network and Security - NSX Edges and select the NSX Manager in Transit (lab: NSX Manager 192.168.10.4). Select first UDLR VM listed and select Delete:
Confirm deletion by clicking Delete again.
Repeat for remaining UDLR control VMs, leaving only ESG(s) listed:
Assign Site A NSX Manager Secondary Role
Navigate to Network and Security - Installation and Upgrade - Management - NSX Managers, select NSX Manager with primary role (lab: nsx-site-b.lab) and select Actions - Add Secondary Manager:
Complete wizard entering Site A NSX Manager details and click Add:
Accept thumbprint and confirm that Site A NSX Manager is now listed as a Secondary Manager:
Navigate to back to Network and Security - NSX Edges select NSX Manager with the newly assigned secondary role (lab: nsx-site-a.lab) and confirm that UDLRs are again listed:
Verify Dynamic Routing Configuration of UDLRs and ESGs
In my test lab, I’m using BGP for my dynamic routing. Your environment may be using OSPF so modify the following commands to fit your circumstance.
Open a console to the Site A Edge VM and issue the command:
Confirm that the Edge appliance shows and “E” (Established) status with all its configured neighbouring UDLRs (lab UDLRs: 192.168.100.15 and 192.168.200.15) and the upstream router (lab LABROUTER: 192.168.111.1):
Next, issue the command:
Confirm that the edge is receiving routes from both the UDLRs and the upstream router:
Test
Finally, run some trace routes to confirm that traffic is following the correct path into the environment:
and out of the environment:
Conclusion and Wrap Up
In this post we recovered our primary NSX for Data Center site, Site A. With the steps detailed in this post, we demoted our back from the dead Site A to secondary site status, regained our NSX control plane and proved correct traffic ingress/egress to and from our previously dead site.
This was part 3 of a multipart series. Find the other parts here:
- Part 1: Why and Getting Familiar
- Part 2: Bye-bye Site A!
- Part 3: This part - Site A Back from the Dead!
- Part 4: Making Site A Primary Again
Next time, in part 4, we’ll look at promoting Site A back to primary status again.
Stay tuned..!
-Chris