Over 6 months ago in Dec 2009, we started experiencing unexplained latency spikes in our Netapp central storage. These would cause the VMware virtual machines (VMs) to crash - typically the linux VMs would revert to a read-only mode and the windows VMs would not recover without a reboot. The average latency was in the 10 millisecond range, but would spike over 1000 milliseconds (1 second) at seemingly random times off peak from our normal VM workloads. Netapp's Operations Manager (previously known as Data Fabric Manager (DFM)) logs the statistics and Performance Advisor (PA) is used to review/query the data.
We would open Netapp support cases, send in perfstats for analysis, but because we could not predict the timing or reproduce the spikes on demand we never had good data to for root cause analysis. We were told to align our VMs to improve Netapp performance but without any way to quantify the effect of misalignment, the project to align 100's of VMs was de-prioritized - especially since the Netapp alignment tool required the VM to be down for a number of minutes relative to the size of the VMs vmdk disk files.
In late May of 2010, the spikes started happening much more frequently (several per week) and we opened a new case uploading perfstats which now included the spike data. According to Netapp, there were several snapmirror operations in progress at the time of latest perfstat spike capture. This did not seem unusual since we had 6 volumes on this aggregate of 67 x 10 K RPM disks scheduled to snapmirror our VM data to another Netapp 3040 cluster 3-4 times per hour.
But it started me thinking - so how much of the disk IOPS are related to snapmirror operations? Performance Advisor was good at showing me the total IOPS for the aggregate, and IOPS for the volumes, but I wanted to map the IOPS to the function (snapmirror, dedup, NFS (VMware), etc)
I signed off the Netapp conference call that morning announcing I was going to deconstruct the total IOPS.
First a little background on Disk types and their inherent physical latency profiles.
Disks and the disk aggregates they comprise will exhibit a normal increase inaverage latency while the IOPS level (workload) increases. According to Netapp, at a certain point the IOPS and disk_busy percentage (a Netapp Performance Advisor metric) will become too high and a latency spike will result - this is expected behavior according to Netapp. For our 10K RPM aggregate this level turned out to be in the 120-150 IOPS per data drive. Performance Advisor was showing a strange flat consistent level of IOPS and disk_busy %
Since Netapp's analysis of the perfstat showed snapmirrors were busy at the time of the last spike, I decided to propose a temporary disabling of the snapmirrors for all 6 volumes on the busy aggregate. We did so, and the results were apparent from the PA disk_busy % view:
Of the 6800 IOPS per second on the aggregate, 3400 of them disappeared with snapmirror disabled (~50% of the IOPS were related to snapmirror operations!)
Only about 1000 IOPS/sec were related to the VMware NFS operations (< 15% of the 6800 total). After another conference call with Netapp to discuss the results, we decided we needed to reschedule our VM snapmirrors from 3-4/hour to once every 2 hours.
With this snapmirror modification we saw our disk busy % drop from 80% to 40%.
We could also document the IOPS load from dedup operations (they default to midnight) and aggr scrubs (default Sunday morning 2am-7am).
Now we realize the latency degradation profile is not slow and gradual, but a drastic collapse when the disks physical limits are reached, we are much more wary of adding any additional IOPS load (not just the traditional external NFS load - but also all the internal Netapp features like snapmirror that can actually dwarf the work your clients are doing - as was the case with us)
Or at least we can quantify which aggregates are approaching their critical IOPS thresholds and migrate load to other less busy aggregates.
I look forward to the additional vStorage API features expected in vSphere 4.x - including the datastore latency tracking and ability to configure QoS type priorities for sets of VMs to get IO resources before other sets.