Friday, July 16, 2010

Quantifying VMDK Misalignment

I admit it, I'm a skeptic - but when there are many items competing for a limited number of waking hours, I've found a healthy measure of skepticism serves one's sanity well!
So when a vendor recommends a change or action I ask a few questions:
1) why - what are the technical reasons behind this recommendation?
2) how can I measure the current state and the effect of your recommendation?

Such was the case with Netapp's long standing recommendation to align VMs.
Until very recently the answers were
1) its less than optimal now with unaligned VMs
2) you can't measure it with customer tools - just trust us, unaligned VMs create inefficiencies and they could be causing your outages !

These vague and less than transparent answers combined with the fact Netapp's VM alignment tools require the VMs to be off for hours (depending on the size of the VM) meant the alignment recommendation was de-prioritized until we could
a) get clear answers
b) measure the before and after states

Last month we finally received both a technically sufficient explanation and a method to measure unaligned IO (via an advanced privileged stat) to re-prioritize the alignment project and organize VM downtime with all customers to get it done.

on the why: after several outages in 2 weeks we escalated to Netapp support and we finally had a Netapp engineer assigned who could explain from our perfstat captures the effect of the misaligned IO - how it was tipping ONTAP from normal mode to severely degraded (latency spike inducing , VM killing) mode - (and I'm paraphrasing here ;)

On the quantification:
It turns out one of the only ways to currently measure the effect of unaligned IO on your netapp is via the pw.over_limit counter. Its not standard (eg available via SNMP).
Its available only in "priv set advanced" command line mode - so a little effort was needed to get this counter tracked over time in our cacti trending system.

A cronjob kicks off the script to ssh into the Netapp head every 5 minutes and log the counter to a file for cacti to pickup and graph.
pw.over_limit counter

ssh netapp " priv set advanced ; wafl_susp -w " |& egrep pw.over | awk '{print $NF}'

Cacti, was then configured to graph the counter deltas every 5 minutes:


We immediately noticed time of day patterns of spikes in this stat and were able to use this information to isolate the source of more aligned IO (eg daily tripwire reports kicking off at 4am) - we had a couple latency spike outages in the 4-5am timeframe which before we were tracking this stat were a total mystery because the regular IOPs stats from Performance Advisor were showing things relatively idle during this time.

Alignment status: 2 weeks ago we had 110 VMs unaligned. Today its 34 and dropping daily thanks to the increased priority due to a fuller understanding of VM (mis)alignment and its effects and how to quantify those over time.

edit 12/1/10:
Exported CactiXML template for those interested

14 comments:

Sebastian Kayser said...

Fletcher, thanks for the writeup!

Did NetApp provide you with a ball park figure on how many partial writes are too much (e.g. in relation to disks / raw IOPS potential) or did the pw.over_limit counter just give you something to correlate with the latency spikes?

Also if there's more background info on the "how it was tipping ONTAP from normal mode to severely degraded" part, I would be glad to learn about it. What exactly is going on at the WAFL layer that induces such hefty latencies?

Sebastian

VCP #20255 said...

@Sebastian - thanks for the feedback - surprisingly there is little concrete publicly available information on this - and that was the reason I opened a Request For Enhancement (RFE) with Netapp on this:

Description of Problem: PA RFE: quantify impact of unaligned I/O Hi, we've struggled with latency spike outages which we now realize were due at least in part to unaligned virtual machines (VMs)
This is a serious problem (see case 2001447643) with seemingly no customer facing tools to quantify the effect before a huge latency spike causes VMs to crash and potentially lose data.
The purpose of this case is to request enhancement in Performance Advisor to help quantify:
1) level of unaligned IO
2) level of unaligned IO relative to thresholds which will cause ONTAP to degrade
3) level of unaligned IO as a percentage of total IO

Currently we are using the priv counter pw.over_limit
In case 2001447643 the netapp engineer did a custom computation for #3 above and used this derived statistic to as a basis to conclude the unaligned IO was dangerously high.

eric said...

Hi guys,

We are struggling with the exact same issue as Fletcher here, and its amazing how few people know about this and how poorly vmware is implemented (in our case by vmware themselves!).

Might I add that this problem is not exclusive to NetApp.

Misaligned LUNs will cause issues on any storage. Its also avoided by correct implementation of vmware.

We also have got a mix of SATA and FC disks on the controller which does not help, add to this that we also saw snapmirror struggle (due to long CP times). We got around it by increasing SM updates to every 15 minutes (used to be every 3 minutes).

And all of this because of a P2V tool that is free and does not align filesystems..

Anders said...

Hi, we're having the exact same problems with a lot of misaligned VM's causing performance loss. Just wondering, what tool do you use to align your VM's? Our systems are critical and cant be down the amount of time that Netapps mbrtool requires.

VCP #20255 said...

Hi Anders,

I heard on www.vmdamentals.com - but I have not tested:

"If you cannot live with the downtime, but need to align anyway, you could consider to look at Platespin products. They can perform a “hot” V2V and align in the process. When data moving is complete, they fail over from the original VM to the newly V2Ved VM, syncing the final changes on the destination disk(s). You end up with an aligned copy of your VM with minimal downtime."

Matt said...

Would you be able to provide a copy of the Cacti XML and script file you produced for this. I've been struggling to work out how to put these together.

Thanks.

VCP #20255 said...

@Matt - I've added a link to the post to download the exported cacti xml
Hope its proves useful for you!

VCP #20255 said...

After we aligned all our VMs, the partial writes did not disappear entirely. We determined Oracle on NFS was responsible for the remaining unaligned IO.

Apparently newer versions of Oracle provide an option to align this IO, and ONTAP 7.3.5.1 provides a new nfstat switch to track unaligned IO associated with specific files (eg Oracle redo logs) on NFS:
Please see:
http://communities.netapp.com/post!reply.jspa?message=49201

StockManiac2008 said...

thanks for the tip on using cacti to graph this. Would you mind sharing the getPWOL.sh script too? Am new to cacti and struggling to get it to work.

VCP #20255 said...

@StockManiac:
Sure - getPWOL.sh simply echos the latest counter value:

#!/bin/bash
echo -n "pwol:"`lynx --dump http://webserver.withlogfileaccess.edu/netapp/$1/pwol.log | head -1`

So I have a cronjob running on a linux host with ssh key access to ssh into the netapp every 5 minutes and log the pw_overlimit to this file and the cacti script retrieves it via HTTP:
ssh netapp " priv set advanced ; wafl_susp -w " |& egrep pw.over | awk '{print $NF}'

Invisible said...

Hello Fletcher

Well, really useful info.

I'm running 8.0.2 and with nfsstat -d can see number of misaligned IOs, for eaxmple

Misaligned Read request stats
BIN-0 BIN-1 BIN-2 BIN-3 BIN-4 BIN-5 BIN-6 BIN-7
139722 3224 1757 2112 1206 1104 74728 119818
Misaligned Write request stats
BIN-0 BIN-1 BIN-2 BIN-3 BIN-4 BIN-5 BIN-6 BIN-7
396395 1892 204 24925 176 47025 1401960 267476

Do you know how async_read, pw.over and wp.partial write counters correlate to the BIN-X output above?

Second question is - what might be the cause if misaligned IOs still persist in ESX only environment?

vExpert2011 said...

@invisible - pls see:

http://communities.netapp.com/thread/15458

"Plus, you can have aligned VMs, but still have applications that generated non-aligned writes.

the example below is for an oracle database on NFS, all writes are aligned (it's NFS), but the writes to the redo logs can be for any size between 512b to ..... In this case, the writes can fall through a block boundary."

John Martin said...

Partial writes on an Oracle redo log file (or any other file which is written to sequentially) are handled pretty well by the existing partial write mechanisms inside of ONTAP. For the most part these are held in memory until the subsequent writes to the log file come in and these are combined internally into a single 4K block. The real killer is partial overwrites which I think I covered off in a blog post here

http://storagewithoutborders.com/2010/07/21/data-storage-for-vdi-part-8-misalignment/

storagewithoutborders.com said...

Partial writes on an Oracle redo log file (or any other file which is written to sequentially) are handled pretty well by the existing partial write mechanisms inside of ONTAP. For the most part these are held in memory until the subsequent writes to the log file come in and these are combined internally into a single 4K block. The real killer is partial overwrites which I think I covered off in a blog post here

http://storagewithoutborders.com/2010/07/21/data-storage-for-vdi-part-8-misalignment/