Sunday, January 29, 2012

3 VSWP Relocation Methods

As usual this post is born of a real situation encountered in our VMware infrastructure.
We were trialing a new storage array with SSD and decided to configure swap files for the test VMs to be located on the SSD backed array - great, if they actually end up swapping (should never happen, but lets try out this vSphere 5 feature!), then they will have the benefit of SSD performance.

Fast forward to the end of the SSD array trial 3 months later when its time to unmount the datastore (NFS) and vCenter tells us (paraphrasing) "error - open filehandles umount failed"










At first I ssh'ed into the esxi host and ran a 'find /vmfs/volumes/nfsdatastore -mtime -1 -print' and spotted the HA file in the output - so I figured, OK, we can disable HA temporarily to get this unmounted. We disabled HA on the cluster, repeated the unmount and still encountered the "open files" error. Some googling revealed the esxi command line umount directives which also failed, but pointed to look at the vmkernel logs, which not so helpfully repeated the "files open" error without revealing WHICH files were open (vexing was the find cmd telling me there were no files on the datastore modified in the last day).

Finally a long listing of the NFS datastore revealed the issue - there were .vswp (swap files) on the datastore and although they were not active (find -mtime -1 did not flag them) - the file handles were open by the 5 running VMs on the esxi5 host.

So what to do? Well first reconfigure the cluster swapfile location property back to "locate swap with VM files" - then how to relocate the active vswp files? My VCP5 studying told me a storage vMotion (METHOD #1 - least disruptive) would do the trick - and this was feasible and worked great for the small VMs (< 100gb).
But I had a pre-prod app server of 150gb - what if we just suspend the VM (METHOD #2 - VM down during suspension) ? Well it worked, the VM was suspended in < 30 seconds and the vswp file disappeared from the trial datastore, then once powered back on, the vswp was recreated with the VM's files
But the last VM was a huge 2 Tb dev Oracle DB with a ton of IO going on. Rather than initiate a storage vMotion (which would take hours even on 10g network) we consulted the DBA who was planning Oracle patching that evening anyway. So we powered down the VM (METHOD #3 - VM power cycle) after is was patched and I powered it back on expecting the vswp file to be relocated, but it was still on the trial datastore! At this point I decided to power down the VM again and was able to successfully unmount the trial datastore. The configuration dictated the vswp should be with the VM, so either this would work, or I'd be opening a P1 case with vmware ;) - it worked, the dev 2Tb Oracle VM powered up fine.

I suspect its a bug of esxi5 logic to not relocate the VM vswp according to the cluster config after a power cycle - but by unmounting the datastore, the end goal of freeing up the trial array for return was accomplished. Anyone else seen this swap file location behavior or have a more elegant method for non-disruptively relocating vswp files?

2 comments:

Pankaj said...

Hi,

I have also face similar issue, to resolve this is have move the VMs to other datastore by storage vMotion, as vm size is not more than 70GB....
but thanks for this details steps.

StockManiac2008 said...

quick question - This new SSD based NFS storage you are evaluating - is that Tintri ?