Chris Hall bio photo

Chris Hall

Making technology fit my requirements...

Windows, Linux, Virtualisation, Unified Computing & Cloud Offerings Design Engineer.

PolarCloudsUK Chris LinkedIn Github
Chris Hall VMware vExpert 2020

NSX Last time we recovered from the loss of our NSX Data Center primary site. If you’ve not seen that post, catch up now. It’s a great read. :wink:

As mentioned, this post is part 4 of a multipart series. Find the other parts here:

To recap, the NSX Data Center control plane components (consisting of the NSX Controller cluster and the Universal Logical Distributed Router (UDLR) control VMs) can only exist on one site; the primary site. In the event of loss of the primary site the control VMs must be recreated at a secondary site to reinstate the NSX control plane. When we lost the primary site, we recreated them at secondary site to reinstate the NSX control plane. In this post we will promote Site A back to being our primary site.

Overview

The Lab

Site A Back(Click image to zoom in)
As a refresher, here is where we are:

  • NSX Controller cluster has been rebuilt on Site B
  • Universal Site A UDLR control VM has been rebuilt on Site B
  • Universal Site B UDLR control VM has been rebuilt on Site B

Additionally, Site B NSX Manager is now our primary manager.

We need to make Site A our primary site again.

TL,DR - Process Overview

To Lazy, Didn’t Read?
Yep, still got you covered:

  1. Remove Primary Role and Assign to Site A NSX Manager
  2. Deploy Site A controller cluster
  3. Deploy Site A UDLR control VMs
  4. Delete Site B controller cluster
  5. Assign Secondary role to Site B NSX Manager
  6. Confirm Site A UDLR controller VM clean up
  7. Verify configuration of the UDLRs
  8. Verify dynamic routing configuration of the UDLRs and ESGs
  9. Test

Remove Primary Role and Assign to Site A NSX Manager

Log onto Site A vCenter (lab: https://vc-site-a.lab/), navigate to Network and Security - Installation and Upgrade - Management - NSX Managers, select primary NSX Manager, click Actions - Remove Primary Role:

Remove Primary Role

Answer Yes to continue.

Once complete, both NSX Managers will be placed into transit mode:

Transit Mode

Select Site A NSX Manager and click Actions - Assign Primary Role:

Assign Primary Role

Deploy Site A Controller Cluster

Navigate to Network and Security - Installation and Upgrade - Management - NSX Controller Nodes Confirm that Primary (Site A) NSX Manager is selected, confirm common controller attributes and click Add to deploy the first Site A controller:

Create Controller

Once the deployment of the first controller is complete and the controller shows as Connected, repeat the process twice more to deploy two more controllers.

Once all three controllers have been deployed, confirm that they have correctly peered:

Controllers Peered

Deploy Primary Site UDLR Control VMs

Navigate to Network and Security - NSX Edges, confirm that Primary (Site A) NSX Manager is selected and select one of the previously deployed UDLRs. From there, select Configure - Appliance Settings - Add Edge Appliance VM and complete the wizard:

UDLR Deployment

Once deployment completes, repeat for remining UDLRs in the environment until deployment status for all primary site Edges equals Deployed:

Primary Edges

Delete Site B Controller Cluster

Navigate to Network and Security - Installation and Upgrade - Management - NSX Controller Nodes and confirm that Transit (Site B) NSX Manager is selected. Select each controller in turn and select Delete, allowing time for deletion between each:

Delete Controllers

Upon deletion of the final controller, tick Proceed to Force Delete and click Delete:

Assign Secondary Role to Site B NSX Manager

Navigate to Network and Security - Installation and Upgrade - Management - NSX Managers, select primary NSX Manager, click Actions - Add Secondary Manager:

Add Secondary Manager

Complete wizard and click Add:

Complete Secondary Manager Wizard

Accept thumbprint and confirm that Site B NSX Manager is now listed as a Secondary Manager:

New Secondary Manager

Confirm Site A UDLR Controller VM Clean Up

Navigate to Network and Security - NSX Edges, confirm that Secondary (Site B) NSX Manager is selected and confirm status of UDLRs is listed as Active instead of Deployed:

UDLRs Active

Finally, confirm that the controller VMs have been deleted from the secondary site:

Inventory

Verify Configuration of the UDLRs

Navigate to Network and Security - NSX Edges in the primary site and select one of the UDLRs. Select Configure - Interfaces and confirm that connectivity is as expected:

UDLR Interfaces

Confirm other settings such as firewall, DHCP Relay (if configured), etc.

Repeat verification checks on remining UDLRs in the environment.

Verify Dynamic Routing Configuration of UDLRs and ESGs

In my test lab, I’m using BGP for my dynamic routing. Your environment may be using OSPF so modify the following commands to fit your circumstance.

Open a console to both of the Edge VMs in turn and issue the command:

show ip bgp neighbours summary

Confirm that the Site A Edge appliance shows and “E” (Established) status with all its configured neighbouring UDLRs (lab UDLRs: 192.168.100.15 and 192.168.200.15) and the upstream router (lab LABROUTER: 192.168.111.1):

ESG A Established

Confirm that the Site A Edge appliance shows and “E” (Established) status with all its configured neighbouring UDLRs (lab UDLRs: 192.168.100.15 and 192.168.200.15) and the upstream router (lab LABROUTER: 192.168.222.1):

ESG B Established

Next, issue the command:

show ip route

Confirm that the both edges are receiving routes from both the UDLRs and the upstream router:

Site A ESG Routes

Site B ESG Routes

Test

Finally, run some trace routes to confirm that traffic is following the correct path into the environment:

Traffic Ingress

and out of the environment. Site A:

Site A Traffic Egress

Site B:

Site B Traffic Egress

Conclusion and Wrap Up

So there we have it. The complete failure of an NSX Data Center primary site, promotion of secondary site to primary site status and subsequent recovery of control plane in remaining site. Once the failed site came back online the repromotion of the site to primary status and the clean up of temporary control plane in the newly demoted site.

Phew! That’s it for this multipart series. Hope you enjoyed it. Remember to a link to this series safe. You never know when you may need it!

Find the other parts here:

All in all, a bit of a mission this one, but well worth it should disaster ever strike. :grimacing:

-Chris