Tuesday, July 6, 2010

VMware and Netapp Storage: deconstructing IOPS

Over 6 months ago in Dec 2009, we started experiencing unexplained latency spikes in our Netapp central storage. These would cause the VMware virtual machines (VMs) to crash - typically the linux VMs would revert to a read-only mode and the windows VMs would not recover without a reboot. The average latency was in the 10 millisecond range, but would spike over 1000 milliseconds (1 second) at seemingly random times off peak from our normal VM workloads. Netapp's Operations Manager (previously known as Data Fabric Manager (DFM)) logs the statistics and Performance Advisor (PA) is used to review/query the data.


We would open Netapp support cases, send in perfstats for analysis, but because we could not predict the timing or reproduce the spikes on demand we never had good data to for root cause analysis. We were told to align our VMs to improve Netapp performance but without any way to quantify the effect of misalignment, the project to align 100's of VMs was de-prioritized - especially since the Netapp alignment tool required the VM to be down for a number of minutes relative to the size of the VMs vmdk disk files.
In late May of 2010, the spikes started happening much more frequently (several per week) and we opened a new case uploading perfstats which now included the spike data. According to Netapp, there were several snapmirror operations in progress at the time of latest perfstat spike capture. This did not seem unusual since we had 6 volumes on this aggregate of 67 x 10 K RPM disks scheduled to snapmirror our VM data to another Netapp 3040 cluster 3-4 times per hour.
But it started me thinking - so how much of the disk IOPS are related to snapmirror operations? Performance Advisor was good at showing me the total IOPS for the aggregate, and IOPS for the volumes, but I wanted to map the IOPS to the function (snapmirror, dedup, NFS (VMware), etc)
I signed off the Netapp conference call that morning announcing I was going to deconstruct the total IOPS.
First a little background on Disk types and their inherent physical latency profiles.


Disks and the disk aggregates they comprise will exhibit a normal increase inaverage latency while the IOPS level (workload) increases. According to Netapp, at a certain point the IOPS and disk_busy percentage (a Netapp Performance Advisor metric) will become too high and a latency spike will result - this is expected behavior according to Netapp. For our 10K RPM aggregate this level turned out to be in the 120-150 IOPS per data drive. Performance Advisor was showing a strange flat consistent level of IOPS and disk_busy %



Since Netapp's analysis of the perfstat showed snapmirrors were busy at the time of the last spike, I decided to propose a temporary disabling of the snapmirrors for all 6 volumes on the busy aggregate. We did so, and the results were apparent from the PA disk_busy % view:


Of the 6800 IOPS per second on the aggregate, 3400 of them disappeared with snapmirror disabled (~50% of the IOPS were related to snapmirror operations!)
Only about 1000 IOPS/sec were related to the VMware NFS operations (< 15% of the 6800 total). After another conference call with Netapp to discuss the results, we decided we needed to reschedule our VM snapmirrors from 3-4/hour to once every 2 hours.
With this snapmirror modification we saw our disk busy % drop from 80% to 40%.
We could also document the IOPS load from dedup operations (they default to midnight) and aggr scrubs (default Sunday morning 2am-7am).
Now we realize the latency degradation profile is not slow and gradual, but a drastic collapse when the disks physical limits are reached, we are much more wary of adding any additional IOPS load (not just the traditional external NFS load - but also all the internal Netapp features like snapmirror that can actually dwarf the work your clients are doing - as was the case with us)
Or at least we can quantify which aggregates are approaching their critical IOPS thresholds and migrate load to other less busy aggregates.

I look forward to the additional vStorage API features expected in vSphere 4.x - including the datastore latency tracking and ability to configure QoS type priorities for sets of VMs to get IO resources before other sets.

9 comments:

StewDaPew said...

-- Disclaimer, NetApp Employee --

Fletcher,

I'm sorry to have read that you have ran into a condition which you and your team had to work to resolve.

From what you shared, it appears that your NetApp array was underperforming due to workload requirements.

In your testing you reduced the workload by disabling SnapMirror replication and performance returned to an acceptable level.

It is the unanimous opinion of my team that the root cause of your issue is the misalignment of the GOS partitions in the VMs.

By aligning your VMs you will be reducing the IO requirements of your workload, thus allowing the array to scale further. I would recommend that you plan to align your VMs, a doing so will defer future hardware purchases that may be required to meet the workload of a non-aligned environment.

NetApp provides a free tool to identify and align your VMs called MBRTools. There are other tools available from NetApp partners like Vizioncore. If you are interested in reducing the downtime required by the VM to run MBRAlign, I'd suggest you install MBRTools into a linux VM. While running MBRAlign from a VM is not officially supported today, we have had many customers report that it worked rather well and performance gains have been reported as high as 900%.

If you are in need of support for the around the requirement to align GOS partitions from within the virtualization industry you can refer to this post of mine:

http://blogs.netapp.com/virtualstorageguy/2009/06/io-efficiency-and-alignment-the-cloud-demands-standards.html

This information also appears in multiple documents from NetApp and our partners.

http://media.netapp.com/documents/tr-3747.pdf
http://media.netapp.com/documents/tr-3749.pdf
http://media.netapp.com/documents/tr-3428.pdf

Thanks for the post, please let me know if you need additional support.

Vaughn
http://blogs.netapp.com/virtualstorageguy/

StewDaPew said...

Fletcher - It would be great if you provided a follow up post once you have aligned your VMs. Areas of interest might include metrics around FAS system utilization gains, VM performance gains, and gains in the reduction in amount of data replicated and replication times, etc.

VCP #20255 said...

Vaughn, I've opened a couple Netapp RFE type cases to request CPU and IOP categorization and tracking in performance advisor - currently answering the questions like:
1) How many ONTAP CPU/thread resources are going to snapmirror, snapshot, NFS, dedup, etc is non-trivial
2) How many aggregate IOPS are going to napmirror, snapshot, NFS, dedup, etc is similarly non-trivial

Errol said...

-- Disclaimer, NetApp Employee --

Fletcher,

A word about the "Disk IOPS vs. Response Time" graph. Those curves are only for 4k random reads only. If you are doing sequential I/O, or larger IOPS, the results would be completely different.

kaiseruncc said...

We have also recently been seeing latency spikes across all our filers. We currently have an environment of 100+ hosts and 1100+ VMs. Currently have approx 10 3000 series filers. SnapMirror is widely used across our environment.

We have looked at VM alignment and only 7% of our VMs are unaligned, yet the latency spike continues to be an issue across the entire environment. I agree that alignment is an important issue, but it's becoming tiring when hearing that constantly from NetApp when it's such a small subset of the environment I work in. I agree it would be interesting to see another study done after the VMs in Fletcher's environment were aligned. Thanks for the research already done though.

vExpert2011 said...

@kaiseruncc - did you try disabling snapmirror for a short time while observing the IOPS drop in NMC?
For us, this made it clear we needed to scale back the frequency stagger the start times of the snapmirror schedule.

jo_strasser said...

Hi Fletcher!

Thank you for your cool threat.

I got the same issue in my environment.

We are using a FAS6240 MetroCluster with 420 spindles, donn´t use SnapMirror. Only deduplication.

But we got all 5-10 sec a latency spike of 500 to 3800ms!

The CPU and Disk loads of the Filer are normal (about 30%).

We have aligned about 95%, but cannot see any difference.

Have you already solved your issues?

Can you tell me more about the problems?

Thanks, Jo!

vExpert2012 said...

Hi Jo -
Trying to determine the source of IO spikes on shared storage is not an easy problem. Here are a few ideas:

1) Is there a pattern (time of day)? Could a virus scan/tripwire job be kicking off and causing the load?
2) Are you on the latest stable Ontap - there could be a bug you are hitting?
3) Try downloading a free trial of vCops and check out the IO per VM heatmap to try and isolate if you have some "bully" IO VMs hogging your aggr's IOPs:
http://www.vmware.com/products/datacenter-virtualization/vcenter-operations-management/overview.html

good luck!

jo_strasser said...

Hi!

I have all this tried.

We are using the latest Ontap version (8.1.1GA).

We are using VCOps to analyze our systems.

But I only can see that the issue occurs on booth metrocluster sites simultaneously with an time difference from a few seconds.

This can maybe occur from the replication (metrocluster).

The spindle load is avg 30%, and that I can´t understand.

We have also checked the fibrechannel connections and cannot find any problems.

It looks like an massive storage problem.

How have you solved your issues?

Thanks, Jo!