Friday, April 15, 2011

vMotion Unicast Flood ESXi

Our vmware mgmt and vmotion nics share the same IP space.
This is not VMware best practice - they recommend vmkernel/vMotion traffic be isolated in its own IP space.

Problem:

While running ESX 4.1 we had no issues associated with the shared IP space, but once we started migrating to ESXi we noticed more and more network disruption during vMotion (especially bulk vMotions when migrating 10-20 VMs for ESXi maintenance). We noticed switches reporting high collisions and drops, and the Juniper firewall load would spike. Forum threads revealed others experiencing the same issue.

Solution:

The issue is resolved by adding a new vMotion vNIC in a private IP space.
This is a best practice recommendation I previously bel
ieved would require a network topology configuration design change with downtime.
But since the
vMotion traffic does not route outside the cluster vmware support was able to demonstrate its as simple as adding a new vmk (vMotion) vNIC dedicated to vMotion ) vmk3 below:

~ # esxcfg-route -l
VMkernel Routes:
Network Netmask
Gateway Interface
169.4.5.0 255.255.255.224 Local Subnet vmk2
17.5.5.0 255
.255.255.0 Local Subnet vmk0
default 0.0.0.0 17.5.5.1 vmk0
Add a new vmkernel port (in VI client: Select your ESXi host->Configuration Tab->Networking->vSwitch Properties->Add->VMkernel):



Choose a private IP space - it does not even need to match the existing default gateway setting since according to vmware support the vMotion traffic is not routed anyway. (I chose the convention of modifying the first octet of our existing IP to make it 10.x.y.z and updating our Networking records accordingly)
~ # esxcfg-route -l
VMkernel Routes:
Network Netmask Gateway Interface
169.4.5.0 255.255.255.224 Local Subnet vmk2
10.5.5.0 255.255.255.0 Local Subnet vmk3 <----new vMotion private net
17.5.5.0 255.255.255.0 Local Subnet vmk0
default 0.0.0.0 17.5.5.1 vmk0
~ #

Note: when you use the VI client to add a vMotion port, the previous vMotion port has its vMotion bit DISABLED (since only one vMotion port is allowed)

Made the change on 2 ESXi hosts, tested bulk vMotions did not cause the network disruptions anymore, then rolled out the changes to the rest of our cluster nodes including ESX hosts for consistency.

Conclusion:
This was a satisfying resolution all round in that our configuration was brought into line with best practices with zero downtime and the network disruption during vMotion issue was addressed at its root cause.

10 comments:

michaelrose said...

Did you move vmotion to a separate physical nic and vlan or did you just create a new port group under the same nics as your management?

VCP #20255 said...

@michaelrose - the latter, following vmware support's recommendation we merely added a new port group and checked "use..for vMotion" - I got the impression from the vmware engineer this was a well understood issue migrating to esxi (whereas about a year ago it was not so common)

Stephen Dion said...

Does anyone know what actually CAUSES the problem? The fix makes sense, and I seem to understand that this occurs more with ESXi than ESX. What is actually causing this?

vExpert2011 said...

Stephen - It gets more subtle - recently we had an ESXi host that was misconfigured and mounted its Netapp NFS datastore read-only (IP was missing from the root section).
Whenever the host was rebooted (for upgrades), the same unicast flood momentarily (shorter duration than the vMotion issue) took out certain switch ports and dumb (enviromux) devices. Feels like a bug, but since it was due to a misconfiguration in this case we just corrected the Netapp exports file.

Cali said...

Right now my Management and vMotion are like this (2 hosts)

Hosts 1
Management - 192.168.23.240
vMotion - 192.168.23.241

Host 2
Management - 192.168.23.242
vMotion - 192.168.23.243


so I should create a vMotion with 10.10.10.x ???? for instance?

vMotion host 1 : 10.10.10.5
vMotion host 2 : 10.10.10.6

is all I need? and test vMotion again?

vExpert2011 said...

Yes, that is the VMware support recommendation I followed to resolve the issue.

Cali said...

I was just checking sorry for bothering you

like I said my network is 192.168.23.x/24

so I can use 192.168.30.x or has to be another different class?

like 10.10.10.x?

thanks a lot

Collins5 said...

After you change the IP range of your vMotion NICs do you need to do anything on the physical switch ports that are connected the vmnics?

Before:
Hosts 1
Management - 192.168.23.240
vMotion - 192.168.23.241
Management Switch Port - VLAN 2, no trunking.
vMotion Switch Port - VLAN 2, no trunking.

Host 2
Management - 192.168.23.242
vMotion - 192.168.23.243
Management Switch Port - VLAN 2, no trunking.
vMotion Switch Port - VLAN 2, no trunking.

After:
Hosts 1
Management - 192.168.23.240
vMotion - 10.1.1.240
Management Switch Port - VLAN 2, no trunking.
vMotion Switch Port - VLAN ?

Host 2
Management - 192.168.23.242
vMotion - 10.1.1.243
Management Switch Port - VLAN 2, no trunking.
vMotion Switch Port - VLAN ?

Do I need to change the VLAN on the physical switch? Do I need to modify the VLAN tagging on the vSwitch?

Thank you.

Cali said...

no you don't need to change anything

vExpert2012 said...

@collins5

18 months ago when we implemented this solution we did not have vlan tagging - so can not advise on your config - but from my understanding the "private" vmotion space should be all that is required.

Recommend you open a VMware support case if possible, to make sure this is still the best practice.

But please comment back on the results.

thanks