Overview
Question: When is the worst time to find the NSX Distributed Firewall System Excluded VM list is empty?
Hint:
Yep that’s it. When running NSX manager VM(s) on a cluster prepared for NSX and then setting the default NSX distributed firewall rules to Drop or Reject.
And that’s when all hell doesn’t break loose. Quite the opposite in fact.
Oh F\/£k !!!!!1!!
Yeah… a whole lot of nothing going on.
Yep. My NSX Manager is completely firewalled from the rest of my network and for all intents and purposes offline.
Choice words were spoken… Very choice words indeed…
Background
Whilst writing an upcoming post on deploying NSX using the vCenter plug-in, I had my vSphere 8.0 and NSX 4.0.1.1 environment all in and spinning nicely. Final thing to talk about were the very bottom three rules of the application category - that is the very last three rules to be evaluated by the distributed firewall - which are set by default to allow any traffic from any source to any destination.
Yep, just set those to drop (or reject) and publish.
Aaaand that is where we find ourselves; a firewalled NSX manager that we are unable to get at via the network to turn off said firewall.
Recovery
So how do we recover from this situation?
How can we disable the NSX distributed firewall without using the NSX web GUI?
From the NSX command line interface (CLI) perhaps?
What is the “magic” command to allow access again?
After much googling and CLI bashing, I discovered this command to be run on the ESXi server hosting my NSX Manager VM (this is a lab so I only have one manager - production environments should have three) to recover access to NSX Manager again:
Which when run on the ESXi host in question, looks like this (I suggest reading discussion points below BEFORE running):
I was then able to access my NSX Manager and change my default rules back.
For good measure after recovery I rebooted my ESXi host too.
Recovery - Discussion Points
Some points to consider:
-
Migrate (vMotion) all VMs other than the affected NSX Manager VM(s) away from the ESXi host that the command is run on. This command will unilaterally remove ALL firewall rules from ALL VMs on the ESXi host it is run on.
-
Searching for the
vsipioctl clearallfilters
command in the VMware Knowledge Base results in just one hit: vNICs are disconnected after NSX is uninstalled from ESXi (74556). At the time of writing this post, the related products section of this KB article lists only NSX for vSphere (aka NSX-v), not NSX-T or NSX as it is now known as. -
After rebooting the ESXi host, you may find that firewall rules are still not being applied to VMs running on the . To resolve disconnect, save, reconnect and save each network connection for each VM affected:
Temporarily moving VMs to a different VDS port group and back may help too. Anything to get the ESXi host to rebuild it’s dvfilter table.
Making Sure This Does Not Happen Again
How can we make sure that this doesn’t happen again?
By making use of the user configured Distributed Firewall Exclusion List:
Whilst we are here, a quick look in the System Excluded VMs list, shows nothing going on as suspected and experienced:
Back to User Excluded Groups, let’s add, name and save our DFW-Excluded group:
Navigate to Inventory > Groups and edit the newly created DFW-Excluded group. Let’s set some members:
I’ll add my NSX Manager by VM name:
As it’s external to the NSX prepared environment (it’s running on another ESXi host), I’ll add my vCenter server by IP as well for good measure:
Save (twice) and lets look at the members of our DFW-Excluded group. First VMs:
Looks good. Checking IP addresses, we can see both the IPs of our vCenter and our NSX Manager listed (Remember production environments should have three NSX Managers deployed - this is just a lab):
Making Sure This Does Not Happen Again - Testing
Let’s pop in a distributed firewall rule to drop ping from any source to any destination:
Yep, we can still ping our NSX Manager:
Finally let’s set the default rules to Drop again, and publish:
A refresh of the browser to check that the vCenter NSX plug-in reloads…
Yep we are all good! Phew!
Conclusion and Wrap Up
Well, a post that I did not think I would need to write. That said, a good post to have in the back pocket should I/you need to stop the NSX distributed firewall in a hurry in the future.
Certainly reading through the Manage a Firewall Exclusion List section of the NSX documentation on the VMware website:
NSX has system excluded virtual machines, and user excluded groups. NSX Manager and NSX Edge node virtual machines are automatically added to the read-only the System Excluded VMs list.
Not in an NSX 4.0.1.1 vSphere 8.0 plug-in environment they are not!
Finally as perhaps proof of a small world, take a look at the third command listed towards the bottom of this post: NSX-T Nested ESXi Host Preparation Failed or Timed Out.
It would seem that I had the answer all along, yet I didn’t know it.
…story of my life…
-Chris