Chris Hall bio photo

Chris Hall

Principal Technical Consultant

PolarCloudsUK Chris LinkedIn Github
Chris Hall Nutanix Certified Master - Multicloud Infrastructure 6 Chris Hall VMware vExpert 2024 Chris Hall VMware vExpert NSX 2023 Chris Hall Nutanix Certified Professional - Multicloud Infrastructure 6 Chris Hall Nutanix Certified Professional - Unified Storage 6 Chris Hall VMware vExpert 2023 Chris Hall VMware vExpert 2022

NSX Last time we recovered from the loss of our NSX Data Center primary site. If you’ve not seen that post, catch up now. It’s a great read. :wink:

As mentioned, this post is part 3 of a multipart series. Find the other parts here:

To recap, the NSX Data Center control plane components (consisting of the NSX Controller cluster and the Universal Logical Distributed Router (UDLR) control VMs) can only exist on one site; the primary site. In the event of loss of the primary site the control VMs must be recreated at a secondary site to reinstate the NSX control plane. When we lost the primary site, we recreated them at secondary site to reinstate the NSX control plane.

Overview

The Lab

Site A Failed(Click image to zoom in)
As a refresher, here is where we are:

  • NSX Controller cluster has been rebuilt on Site B
  • Universal Site A UDLR control VM has been rebuilt on Site B
  • Universal Site B UDLR control VM has been rebuilt on Site B

Additionally, Site B NSX Manager is now our primary manager.

TL,DR - Process Overview

To Lazy, Didn’t Read?
Again here are the process steps for those TL,DRs among us:

  1. Start Site A
  2. Check Site A NSX Manager Configuration and Status
  3. Demote Site A NSX Manager
  4. Delete Site A Controller Cluster
  5. Delete Site A UDLR Control VMs
  6. Assign Site A NSX Manager Secondary Role
  7. Verify Dynamic Routing Configuration of UDLRs and ESGs
  8. Test

Site A Start Up

OK, so site A is back from the dead. Lets get the site powered back on so we can reinstate it as an NSX secondary site in the first instance. From there we can promote it to primary again. Lets power on Site A in the following order:

  • ESXi Host(s)
  • vCenter Server
  • NSX Manager
  • Controller Cluster
  • Universal Logical Distributed Router (UDLR) Control VMs
  • Edge Service Gateways (ESG) VMs

Check Site-A NSX Manager Configuration and Status

Just as we did with Site B’s NSX manager during failover in part 2, let’s confirm that Site A’s NSX Manager is happy and registered with vCenter. Access Site A NSX manager via web browser (lab: https://nsx-site-a.lab), login and navigate to View Summary. Confirm that the NSX Management Components are running:

NSX Management Components

and confirm vCenter Registration Home - Manage vCenter Registration shows as green:

vCenter Registration

Demote Site A NSX Manager

Log onto Site A vCenter (lab: https://vc-site-a.lab/), navigate to Network and Security - Installation and Upgrade - Management - NSX Managers:

Two Primary Managers

As you can see, both Site A and Site B NSX Managers believe that they are the primary NSX Manager. Lets look closer at the sync issue:

Sync Issue

Fair enough, we disconnected Site B NSX Manager from Site A NSX Manager during the failover.

Select Site A NSX Manager and select Actions - Remove Secondary Manager:

Remove Secondary Manager

Tick select Perform Operation even if NSX Manager is inaccessible and Remove:

Perform Removal

Next, select Site A NSX Manager again and select Actions - Remove Primary Role:

Remove Primary Role

Answer Yes to the warning (we’ll clean up our controllers in the next step):

Yes to Warning

Site A NSX Manager will then be placed into Transit mode:

Site A Manager in Transit Mode

Delete Site A Controller Cluster

Navigate to Network and Security - Installation and Upgrade - NSX Controller Nodes and select the NSX Manager in Transit (lab: NSX Manager 192.168.10.4). Select each controller in turn and select Delete, allowing time for deletion between each:

Delete Controllers

Upon deletion of the final controller, tick Proceed to Force Delete and click Delete:

Force Delete

Delete Site A UDLR Control VMs

Navigate to Network and Security - NSX Edges and select the NSX Manager in Transit (lab: NSX Manager 192.168.10.4). Select first UDLR VM listed and select Delete:

Delete UDLR

Confirm deletion by clicking Delete again.

Repeat for remaining UDLR control VMs, leaving only ESG(s) listed:

Just ESG remaining

Assign Site A NSX Manager Secondary Role

Navigate to Network and Security - Installation and Upgrade - Management - NSX Managers, select NSX Manager with primary role (lab: nsx-site-b.lab) and select Actions - Add Secondary Manager:

Add Secondary Manager

Complete wizard entering Site A NSX Manager details and click Add:

Add Secondary Manager Wizard

Accept thumbprint and confirm that Site A NSX Manager is now listed as a Secondary Manager:

Secondary Manager Added

Navigate to back to Network and Security - NSX Edges select NSX Manager with the newly assigned secondary role (lab: nsx-site-a.lab) and confirm that UDLRs are again listed:

UDLRs are back

Verify Dynamic Routing Configuration of UDLRs and ESGs

In my test lab, I’m using BGP for my dynamic routing. Your environment may be using OSPF so modify the following commands to fit your circumstance.

Open a console to the Site A Edge VM and issue the command:

show ip bgp neighbours summary

Confirm that the Edge appliance shows and “E” (Established) status with all its configured neighbouring UDLRs (lab UDLRs: 192.168.100.15 and 192.168.200.15) and the upstream router (lab LABROUTER: 192.168.111.1):

BGP Established

Next, issue the command:

show ip route

Confirm that the edge is receiving routes from both the UDLRs and the upstream router:

Routes

Test

Finally, run some trace routes to confirm that traffic is following the correct path into the environment:

Trace Route In

and out of the environment:

Trace Route Out

Conclusion and Wrap Up

In this post we recovered our primary NSX for Data Center site, Site A. With the steps detailed in this post, we demoted our back from the dead Site A to secondary site status, regained our NSX control plane and proved correct traffic ingress/egress to and from our previously dead site.

This was part 3 of a multipart series. Find the other parts here:

Next time, in part 4, we’ll look at promoting Site A back to primary status again.

Stay tuned..! :smiley:

-Chris