Saturday, July 24, 2010

Error Upgrading VMware Tools

Recently we upgraded to vSphere 4.1 and this as expected included a new version of VMware Tools.
Many of the tools upgrades we initiated from vCenter were immediately reported as failed with the very non-descriptive:

Error Upgrading VMware Tools.

Turns out, if you look at the corresponding vmware.log for the failed tools upgrade VM, you will see errors to the effect: "TOOLS INSTALL Error copying upgrader binary into guest"
For the windows failure, fix this by deleting the binary left over from the previous upgrade:

del C:\WINDOWS\Temp\VMwareToolsUpgrader.exe

For Linux failures, create the destination directory:

mkdir /tmp/vmware-root

and retry the VMware Tools upgrade from vCenter - it will proceed without error.

Hopefully VMware fixes this with a patch of some sort soon.

Friday, July 16, 2010

Quantifying VMDK Misalignment

I admit it, I'm a skeptic - but when there are many items competing for a limited number of waking hours, I've found a healthy measure of skepticism serves one's sanity well!
So when a vendor recommends a change or action I ask a few questions:
1) why - what are the technical reasons behind this recommendation?
2) how can I measure the current state and the effect of your recommendation?

Such was the case with Netapp's long standing recommendation to align VMs.
Until very recently the answers were
1) its less than optimal now with unaligned VMs
2) you can't measure it with customer tools - just trust us, unaligned VMs create inefficiencies and they could be causing your outages !

These vague and less than transparent answers combined with the fact Netapp's VM alignment tools require the VMs to be off for hours (depending on the size of the VM) meant the alignment recommendation was de-prioritized until we could
a) get clear answers
b) measure the before and after states

Last month we finally received both a technically sufficient explanation and a method to measure unaligned IO (via an advanced privileged stat) to re-prioritize the alignment project and organize VM downtime with all customers to get it done.

on the why: after several outages in 2 weeks we escalated to Netapp support and we finally had a Netapp engineer assigned who could explain from our perfstat captures the effect of the misaligned IO - how it was tipping ONTAP from normal mode to severely degraded (latency spike inducing , VM killing) mode - (and I'm paraphrasing here ;)

On the quantification:
It turns out one of the only ways to currently measure the effect of unaligned IO on your netapp is via the pw.over_limit counter. Its not standard (eg available via SNMP).
Its available only in "priv set advanced" command line mode - so a little effort was needed to get this counter tracked over time in our cacti trending system.

A cronjob kicks off the script to ssh into the Netapp head every 5 minutes and log the counter to a file for cacti to pickup and graph.
pw.over_limit counter

ssh netapp " priv set advanced ; wafl_susp -w " |& egrep pw.over | awk '{print $NF}'

Cacti, was then configured to graph the counter deltas every 5 minutes:


We immediately noticed time of day patterns of spikes in this stat and were able to use this information to isolate the source of more aligned IO (eg daily tripwire reports kicking off at 4am) - we had a couple latency spike outages in the 4-5am timeframe which before we were tracking this stat were a total mystery because the regular IOPs stats from Performance Advisor were showing things relatively idle during this time.

Alignment status: 2 weeks ago we had 110 VMs unaligned. Today its 34 and dropping daily thanks to the increased priority due to a fuller understanding of VM (mis)alignment and its effects and how to quantify those over time.

edit 12/1/10:
Exported CactiXML template for those interested

Thursday, July 15, 2010

vSphere 4.1 upgrade gotchas: ssh, vCenter changes

With the release of vSphere 4.1 this week, we upgraded the lab cluster to check out the new features - especially the vStorage API stats like latency & IOPS.
After instantiating a new vCenter VM (64bit windows 2008 - because 4.1 requires 64bit now) - I used the new 4.1 vCenter (loaded with the 4.1 upgrade zip file (upgrade-from-ESX4.0-to-4.1.0-0.0.260247-release.zip)). Each of the Dell 1950 nodes in the lab cluster completed the upgrade and reboot in under 15 minutes.
But what we found was we could not longer ssh into the 4.1 nodes with our user accounts.
In /var/log/messages we saw (because we could get in as root on the console):
error: PAM: Permission denied for useracct from sourceIP

The /etc/passwd accounts were intact, and we could su - useracct - so what changed?

Turns out on the bottom of page 65 of the vsp_41_upgrade_guide.pdf the user accounts now need to be listed as root group members to allow ssh for them:

NOTE After upgrading to ESX 4.1, only the Administrator user has access to the service console. To grant service console access to other users after the upgrade, consider granting the Administrator permissions to other users.

So editting the /etc/group file and adding all the users we had in the wheel group for sudo access to the root group fixed the issue immediately.

If VMware's intent was to get the attention of ESX users "Hey 4.1 is the last ESX version - get migrating to ESXi!" - mission accomplished :)

Other than those minor hiccups (64 bit required for vCenter, ssh breaking) we are impressed with all the new features and performance improvements VMware has packed into a "minor" (4.0->4.1) release.

Tuesday, July 6, 2010

VMware and Netapp Storage: deconstructing IOPS

Over 6 months ago in Dec 2009, we started experiencing unexplained latency spikes in our Netapp central storage. These would cause the VMware virtual machines (VMs) to crash - typically the linux VMs would revert to a read-only mode and the windows VMs would not recover without a reboot. The average latency was in the 10 millisecond range, but would spike over 1000 milliseconds (1 second) at seemingly random times off peak from our normal VM workloads. Netapp's Operations Manager (previously known as Data Fabric Manager (DFM)) logs the statistics and Performance Advisor (PA) is used to review/query the data.


We would open Netapp support cases, send in perfstats for analysis, but because we could not predict the timing or reproduce the spikes on demand we never had good data to for root cause analysis. We were told to align our VMs to improve Netapp performance but without any way to quantify the effect of misalignment, the project to align 100's of VMs was de-prioritized - especially since the Netapp alignment tool required the VM to be down for a number of minutes relative to the size of the VMs vmdk disk files.
In late May of 2010, the spikes started happening much more frequently (several per week) and we opened a new case uploading perfstats which now included the spike data. According to Netapp, there were several snapmirror operations in progress at the time of latest perfstat spike capture. This did not seem unusual since we had 6 volumes on this aggregate of 67 x 10 K RPM disks scheduled to snapmirror our VM data to another Netapp 3040 cluster 3-4 times per hour.
But it started me thinking - so how much of the disk IOPS are related to snapmirror operations? Performance Advisor was good at showing me the total IOPS for the aggregate, and IOPS for the volumes, but I wanted to map the IOPS to the function (snapmirror, dedup, NFS (VMware), etc)
I signed off the Netapp conference call that morning announcing I was going to deconstruct the total IOPS.
First a little background on Disk types and their inherent physical latency profiles.


Disks and the disk aggregates they comprise will exhibit a normal increase inaverage latency while the IOPS level (workload) increases. According to Netapp, at a certain point the IOPS and disk_busy percentage (a Netapp Performance Advisor metric) will become too high and a latency spike will result - this is expected behavior according to Netapp. For our 10K RPM aggregate this level turned out to be in the 120-150 IOPS per data drive. Performance Advisor was showing a strange flat consistent level of IOPS and disk_busy %



Since Netapp's analysis of the perfstat showed snapmirrors were busy at the time of the last spike, I decided to propose a temporary disabling of the snapmirrors for all 6 volumes on the busy aggregate. We did so, and the results were apparent from the PA disk_busy % view:


Of the 6800 IOPS per second on the aggregate, 3400 of them disappeared with snapmirror disabled (~50% of the IOPS were related to snapmirror operations!)
Only about 1000 IOPS/sec were related to the VMware NFS operations (< 15% of the 6800 total). After another conference call with Netapp to discuss the results, we decided we needed to reschedule our VM snapmirrors from 3-4/hour to once every 2 hours.
With this snapmirror modification we saw our disk busy % drop from 80% to 40%.
We could also document the IOPS load from dedup operations (they default to midnight) and aggr scrubs (default Sunday morning 2am-7am).
Now we realize the latency degradation profile is not slow and gradual, but a drastic collapse when the disks physical limits are reached, we are much more wary of adding any additional IOPS load (not just the traditional external NFS load - but also all the internal Netapp features like snapmirror that can actually dwarf the work your clients are doing - as was the case with us)
Or at least we can quantify which aggregates are approaching their critical IOPS thresholds and migrate load to other less busy aggregates.

I look forward to the additional vStorage API features expected in vSphere 4.x - including the datastore latency tracking and ability to configure QoS type priorities for sets of VMs to get IO resources before other sets.