Thursday, April 5, 2012

ESXi Reboot Causes ARP flood

We recently encountered a very strange issue where rebooting one of our ESXi 5.0 hosts would cause network disruption (NTI Enviromux devices would drop off the net, older switches would experience packet loss and vLAN reconfig flapping).

Eventually we determined with a packet capture the host was flooding the subnet with ARP requests related to one of its NFS datastores.  When we removed this datastore the host could reboot without causing the network disruption.  But we needed this datastore to satisfy HA requirement for 2 heartbeat datastores. 

The datastores are all provisioned from Netapp with the exports file listing each host's IP address for RW and ROOT access.  Upon closer inspection, the problem host's IP was listed only in the RW section of the Netapp exports file.  So, while the datastore would show up mounted, ESXi was not given full access it needed for HA heartbeat functionality, and the result was this flood of ARP.
Once the exports file was updated with IP in the ROOT section, the datastore remounted with full permissions and the host stopped exhibiting the problematic ARP, network disruption behavior.

We still have a case open with VMware to determine why ESXi 5.0 Update 1 behaves this way (allowing an NFS mount without the access required for full functionality, plus unleashing an ARP flood)


StockManiac2008 said...

Wow, thank you for this post. We have been breaking our heads with the same issue that would cause network disruption when a ESX server was rebooted. However we are seeing this on ESX4.0 systems. So it is not just limited to ESXi5.
In our case the issue was the nfs exports had a typo in them
/vol/vmapp -sec=sys,rw=xx.xx.x.1/24,root=xx.xx.x.1/24
instead of
/vol/vmapp -sec=sys,rw=xx.xx.x.0/24,root=xx.xx.x.0/24
The last octet had a 1 instead of 0.


StockManiac2008 said...

Oh, wanted to add that this only happens to us when we have Spanning Tree turned ON on those ESX's ports. With Spanning Tree OFF, and with the typo, we did not see the issue on the switch.

vExpert2011 said...

@StockManiac glad you resolved it - yes, we have spanning tree enabled - of course we should configure NFS exports correctly - but vmware should fix this with a patch.

StockManiac2008 said...

Doesn't the best practice say Spanning tree is to be turned OFF for ESX hosts and NetApp filers?
We have had other issues, like long delays during filer failover/takeover with ST turned ON.


PTCruiserGT said...

I know it's been a while, but I'm guessing your case with VMware went nowhere. After 2 years, they finally published an article for the issue, but they aren't admitting to a problem on their end.