Sunday, January 29, 2012

3 VSWP Relocation Methods

As usual this post is born of a real situation encountered in our VMware infrastructure.
We were trialing a new storage array with SSD and decided to configure swap files for the test VMs to be located on the SSD backed array - great, if they actually end up swapping (should never happen, but lets try out this vSphere 5 feature!), then they will have the benefit of SSD performance.

Fast forward to the end of the SSD array trial 3 months later when its time to unmount the datastore (NFS) and vCenter tells us (paraphrasing) "error - open filehandles umount failed"

At first I ssh'ed into the esxi host and ran a 'find /vmfs/volumes/nfsdatastore -mtime -1 -print' and spotted the HA file in the output - so I figured, OK, we can disable HA temporarily to get this unmounted. We disabled HA on the cluster, repeated the unmount and still encountered the "open files" error. Some googling revealed the esxi command line umount directives which also failed, but pointed to look at the vmkernel logs, which not so helpfully repeated the "files open" error without revealing WHICH files were open (vexing was the find cmd telling me there were no files on the datastore modified in the last day).

Finally a long listing of the NFS datastore revealed the issue - there were .vswp (swap files) on the datastore and although they were not active (find -mtime -1 did not flag them) - the file handles were open by the 5 running VMs on the esxi5 host.

So what to do? Well first reconfigure the cluster swapfile location property back to "locate swap with VM files" - then how to relocate the active vswp files? My VCP5 studying told me a storage vMotion (METHOD #1 - least disruptive) would do the trick - and this was feasible and worked great for the small VMs (< 100gb).
But I had a pre-prod app server of 150gb - what if we just suspend the VM (METHOD #2 - VM down during suspension) ? Well it worked, the VM was suspended in < 30 seconds and the vswp file disappeared from the trial datastore, then once powered back on, the vswp was recreated with the VM's files
But the last VM was a huge 2 Tb dev Oracle DB with a ton of IO going on. Rather than initiate a storage vMotion (which would take hours even on 10g network) we consulted the DBA who was planning Oracle patching that evening anyway. So we powered down the VM (METHOD #3 - VM power cycle) after is was patched and I powered it back on expecting the vswp file to be relocated, but it was still on the trial datastore! At this point I decided to power down the VM again and was able to successfully unmount the trial datastore. The configuration dictated the vswp should be with the VM, so either this would work, or I'd be opening a P1 case with vmware ;) - it worked, the dev 2Tb Oracle VM powered up fine.

I suspect its a bug of esxi5 logic to not relocate the VM vswp according to the cluster config after a power cycle - but by unmounting the datastore, the end goal of freeing up the trial array for return was accomplished. Anyone else seen this swap file location behavior or have a more elegant method for non-disruptively relocating vswp files?

Friday, January 20, 2012

Passed VCP5 exam

This morning I passed my VCP5 exam!
I had been devoting and hour or two a day to study for 3 weeks leading up to today.
In my opinion VCP5 definitely puts less emphasis on rote memorization of maximums and more on the admin's hands on experience with vSphere 5 - an improvement over the VCP4 exam!

While I was studying for the VCP5 I also registered as a Solution Provider and leveraged the VCP facts to take the VTSP module tests to obtain VMware Technical Sales Professional credentials.
Now I am taking the VSP (VMware Sales Professional) modules online and finding the vCloud modules especially useful in coalescing my strategy for selling new Cloud projects internally.

vMotion fails at 9% - Operation Timed Out

I recently added a new ESXi 5.0.0 host one of our clusters and attempted to vMotion a test VM onto the new host but it failed at 9% with "Operation Timed Out"

Turns out the datastore for the VM was mounted read-only since the new host's IP had not been added to the Netapp ACL allowing full read / write access.

Once the IP was added and the NFS datastore re-mounted RW, the vMotions zipped through at 10g speeds!