Wednesday, August 17, 2011

Quantifying Spindle:VM throughput relationship

Last week we took delivery of an additional DS4243 300Gb x 24 15 RPM disk shelf.
This morning we connected it non-disruptively to our Netapp 3270.
These 24 disks were slated to be assigned to our existing aggregate consisting of 2 x DS4243 shelves, effectively becoming 1/3 of a the spindles (IOPs and storage) of the newly expanded aggregate.
Before adding the disks to expand the aggregate we wanted to take a benchmark from a VM's perspective BEFORE and then compare it to the same VM's performance AFTER the 1/3 IOPs upgrade.


Netapp 3270 running Ontap
10Gb connection to ESXi 4.1 host
HD Tune is used for disk IO benchmark

BEFORE (46 disk Aggregate):

Read transfer rate
Transfer Rate Minimum : 21.5 MB/s
Transfer Rate Maximum : 95.9 MB/s

Transfer Rate Average : 65.8 MB/s
Access Time : 10.3 ms

AFTER (67 Disk Aggregate):

Read transfer rate
Transfer Rate Minimum : 0.6 MB/s
Transfer Rate Maximum : 96.7 MB/s
Transfer Rate Average : 82.9 MB/s
Access Time : 6.67 ms


Throughput: Avg Transfer rate went from 65 to 82.9 (27.5% better)
Latency: Access Time went from 10.3 to 6.67ms (35% improvement)
Also you can clearly see the deviation from the average performance is much improved (The 2nd throughput graph shows transfers rate staying in a much tighter 80-90Mbsec range than the 1st smaller aggr) - this translates into more deterministic performance profile for our VI (the larger aggr can “soak up” the short IOP demand spikes that would have otherwise slowed down the smaller aggr)

Note: it was necessary to clone the "Before" VM to force WAFL to stripe out the VM data onto the newly added disks (Will WAFL do this automatically over time for existing data? Or will only NEW VMs realize the full 67 spindle benefits?)

Monday, August 15, 2011

vCenter SQL Server Scheduled Job Errors

Since migrating to SQL 2008, our app event log was showing errors of the type:

SQL Server Scheduled Job 'Past Day stats rollupVIM_UMDB' (0x2838E6A98D1EBE4CB211E1768836BA68) - Status: Failed - Invoked on: 2011-08-14 17:00:00 - Message: The job failed. Unable to determine if the owner (DOM\dom.acct) of job Past Day stats rollupVIM_UMDB has server access (reason: Could not obtain information about Windows NT group/user 'DOM\dom.acct', error code 0x5. [SQLSTATE 42000] (Error 15404)).

All other vCenter operations seemed to fine, but these jobs were failing and logging every 5-10 minutes.
Turns out the solution is documented in KB15404

The fix is to change the owner of these jobs from the domain account to SA:

Friday, August 5, 2011

False management network redundancy alert

We upgraded our DR cluster hosts recently and ran into this where vCenter was reporting:

HostXYZ "currently has no management network redundancy"

But wait - thats not true, the vSwitch clearly has two active connections!
I tried reconfiguring the ordering of the nics in the teaming config, but the warning persisted.

Then I found KB1004700 which starts talking about ignoring/suppressing the warning (very bad practice to ignore alerts! for what should be obvious reasons).
But I kept reading and saw:

"Note: If the warning continues to appear, disable and re-enable VMware High Availability in the cluster."

So I disabled and enabled HA and this cleared the alert - no suppression of alerts needed!