Monday, September 20, 2010

ESXi vmotion fails 10%

This one could be filed under "Unexpected networking differences between ESX and ESXi"

I was testing out adding a new ESXi 4.1 host to an existing cluster of ESX 4.1 hosts - the operation would time out and fail at 10% progress and I received this error each time I attempted a test vMotion from the ESX hosts to the new ESXi host:
"The VM Failed to resume on the destination during early power on"
I ran through some troubleshooting steps but did not realize the source of the problem until I SSH'ed to the ESXi box and attempted to touch a file on the (Netapp NFS) datastore and received:

Permission denied (Read-only filesystem)

But wait - I had added the vmkernel IP address to the netapp NFS export ACL with full read-write permission like I have done for all ESX hosts in the past.
It turned out, unlike ESX, ESXi was using the mgmt port IP for NFS and vMotion (even though vMotion was disabled for this port):

MGMT (vMotion disabled):


VMkernel (vMotion enabled):


once I added the mgmt port IP to the Netapp ACL and remounted the NFS datastore (truly read-write now), the vMotion succeeded. I'm left to determine why ESX uses the VMkernel port by default for NFS datastores and vMotion, but ESXi seems to default to the mgmt port.
At present the mgmt and VMkernel ports share the same networks, but this may not always be the case.
Comments welcome on this one!

2 comments:

ewen.chen said...

Hello,

I've simillar situation. I've configured my ESXi 4.1 for dual management - one in my local "office" network and one in "server" network. On the server network, there is my NFS storage, and there is vMotion enabled. But on the office one, there is completly another store. And i've got 100% errors, which you have described.
What can i do for workaround that issue?

Regards,
Stanislaw Wawszczak

VCP #20255 said...

Hi Ewen/Stanislaw,

I worked a case on this with VMware -
They admitted the lack of separate routing tables in ESXi (vs ESX) brings this issue up for customers with flat network topologies (console and vmkernel in same routable space).

I was able to work around it by adding my console IP to my storage ACL. I am not about to re-design the network topology overnight to use ESXi - so this is acceptable for now.

Regards,
Fletcher.