Thursday, November 18, 2010

Clearing broken-off snapmirror relationships

We have migrated many vfilers and volumes recently and some of these left behind old snapmirror relationships in the "Broken-off" state (as reported by snapmirror status)
The snapmirror.conf no longer referred to these volumes, so its an internal state needing to be cleared.
Turns out the method that works is snapmirror release:

snapmirror release {srcpath} {dstfiler}:{dstpath}
in which {srcpath} and {dstpath} are
{volname} or {/vol/volname/qtreename}
- remove destinations

I receieved an error when I issued the command:

No release-able destination found that matches those parameters. Use 'snapmirror destinations' to see a list of release-able destinations.

But a subsequent snapmirror status revealed the Broken-off record was cleared (gone)
(Don't always believe what ONTAP tells you ! :)

Friday, November 12, 2010

vFiler Migrate Netapp lockup

While we were using Netapp Provisioning Manager to migrate vFilers from an older 3040 to a new 3170, we ran into a bug causing the netapp head to drop off the network.
On the RLM console I observed many failed operations with the message:

error=No buffer space available

After consulting with Netapp support the recommendation was apply the solution for BUG 90314:

Specifically setting these hidden options to de-prioritize volume deletion related operations (these ops had swamped the netapp during an aborted vfiler migrated (that's another related issue))

options wafl.trunc.throttle.system.max 30
options wafl.trunc.throttle.vol.max 30
options wafl.trunc.throttle.min.size 1530
options wafl.trunc.throttle.hipri.enable off
options wafl.trunc.throttle.slice.enable off

So far, we have not seen the issue again.

Monday, November 8, 2010

vMotion High CPU vmmemctl

I was investigating a strange issue today during a vMotion of a CentOS 5 tomcat server (running 6 tomcat containers)

source host: Dell R900 Intel Xeon E7450 ESX 4.1 260247

dest host: Dell R610 Xeon 5680 ESXi 4.1 260247 (new)

I vMotion'ed the first of 2 Centos 5 VMs from the source to dest while running top and observed an unusual lag in top refresh during the cutover.

after about 15 seconds of no refreshes the VM came back and top showed a high load average with the vmmemctl near the top. I also observed the "mem used" column on this 8Gb VM went from 7.9Gb to 2.4Gb in the span of 10-40 seconds!

It turned out (after examining esxtop->(m)emory output) that the VM was swapping! (eventhough we were no where near overcommitted on these ESX hosts.)

The SWCUR and SWTGT columns showed zeros for all other behaving VMs, but these two app servers had recently had their memory allocation increased from 2 Gb to 8Gb - but the
Edit Settings ->Resources->Memory-> Unlimited checkbox
had not retained its unlimited value.
Updating it to be unlimited and restarting the VM fixed the issue.

Friday, October 8, 2010

VMware Fusion SSD performance

This week I took delivery of a much anticipated upgrade - a new Macbook Pro with SSD ( APPLE+SSD+TS256B )
The SSD macbook pro is performing as advertised – applications start up way faster

Ranks #8 on the SSD list of benchmarks here

Small random IO is way faster on SSD than HDD as this user noted:

In my tests I found IO intensive Operations were greatly sped up on SSD (y axis is time in seconds – lower is better):

Windirstat was over 7x faster
Windows XP restart was over 2x faster

Plus the battery life is extended with SSD and it runs quieter and cooler

Looking forward to being able run more VMs in fusion - the increased IO speed opens up possibilities to even run virtual ESX/ESXi - testing migrations and upgrades of the enterprise VI on the laptop!

Thursday, September 30, 2010

Setup ONTAP Simulator in vSphere

I recently had cause to run some tests on different Netapp ONTAP versions (I am working through a vFiler failover bug with Netapp) and wanted to see if the bug is present in 8.x (we are looking to move from 7.x - but not just yet) - I created 2 instances so I could test snapmirroring between them and vFiler failover.
Netapp does not yet provide an OVF version (OVF recently attained ANSI standard) of the simulator so this one is currently designed to boot directly in VMplayer/Workstation, Fusion, but with vSphere (ESX 4.1) I found the cleanest way is to run it through the converter:

Here are the steps to get the simulator up and licensed:
1) download the simulator ( to the windows VM you have the converter installed on
2) unzip the file and start the converter, click Convert Machine
3) select "VMware Workstation or other VMware virtual machine" and browse to the .vmx you just unzipped
4) give the rest of the info including destination, name, datastore etc
5) once converted, you can boot it up and press CTRL-C for the boot menu
6) select option 4 to clear the config and reboot into the setup
7) provide IP address, name etc to configure the simulator
8) optional: login and issue options httpd.admin.enable on to allow the web interface
9) licensing: get the license codes from

Have fun recreating bugs ;) !

Monday, September 20, 2010

ESXi vmotion fails 10%

This one could be filed under "Unexpected networking differences between ESX and ESXi"

I was testing out adding a new ESXi 4.1 host to an existing cluster of ESX 4.1 hosts - the operation would time out and fail at 10% progress and I received this error each time I attempted a test vMotion from the ESX hosts to the new ESXi host:
"The VM Failed to resume on the destination during early power on"
I ran through some troubleshooting steps but did not realize the source of the problem until I SSH'ed to the ESXi box and attempted to touch a file on the (Netapp NFS) datastore and received:

Permission denied (Read-only filesystem)

But wait - I had added the vmkernel IP address to the netapp NFS export ACL with full read-write permission like I have done for all ESX hosts in the past.
It turned out, unlike ESX, ESXi was using the mgmt port IP for NFS and vMotion (even though vMotion was disabled for this port):

MGMT (vMotion disabled):

VMkernel (vMotion enabled):

once I added the mgmt port IP to the Netapp ACL and remounted the NFS datastore (truly read-write now), the vMotion succeeded. I'm left to determine why ESX uses the VMkernel port by default for NFS datastores and vMotion, but ESXi seems to default to the mgmt port.
At present the mgmt and VMkernel ports share the same networks, but this may not always be the case.
Comments welcome on this one!

Sunday, September 19, 2010

Netapp IOPS threshold alerting

This post could be considered part 2 of the Deconstructing IOPS post where we picked apart the total number of IO operations into their constituent sources to determine which source may be consuming more than expected. In the previous case it turned out we had stacked up snapmirror operations on top of each other (inadvertently over time by adding more and more volumes and snapmirror schedules with the same aggressive SLA).
We had squashed that one, rid ourselves of the VM killing latency spikes and returned our disk busy levels from critical levels (80%) down to normal (40%) as measured by Netapp's Performance advisor.
But about two weeks ago we started seeing new pages to sysadmins during our backup window (Fri evening - Saturday) - webapps timing out and losing DB connectivity (needing to be restarted), virtual machines responding slowly, oracle reporting many long running queries (that usually run quickly). So initially we focused on the backups - obviously they must be killing our performance - but no, nothing was new there network and storage wise. Then I logged back in to Performance Advisor and was stunned to see the IOPS on our busy AGGR1 back up to critical levels (6000+) - was it the snapmirrors again? No, a quick check of the schedule and status showed they were still on their idle every other hour schedules.
Time to repeat the IOPS decontruction exercise: What are the sources of these 6000 IOPS and what is this very regular 5 minute spike pattern in IOPS?

Looking at each volume's IO graph in PA, one quickly stood out as the most likely source with the same 5 minute spike pattern - ora64net which is our Oracle RAC storage.

From the long running queries list I was able to work with the DBA to determine the source was a new monitoring system we are testing (Zabbix) and specifically the orabbix database check.
I then worked with the sysadmin to determine which of the 55 orabbix database checks were causing the IO spikes. He disabled them in turn until it was discovered the audit log query was responsible (the query was operating on a million row unindexed table that was constantly growing from all of our monitoring logins!) - truncating the table to something managable was one solution - disabling the audit check was the immediate relief:

But how to not get surprised like this in the future?
Netapp operations manager custom thresholds and alarms!
Operations Manager not only tracks the metrics presented in performance advisor it provides a facility to configure custom thresholds via the cli DFM for use with Alarms - but you need to drop to the command line to configure these:
C:\ dfm perf threshold create -o 513 -d 300 -e aggr1_iops_4k -C "aggregate:total_transfers 4000 per_sec upper"
C:\ dfm perf threshold create -o 513 -d 300 -e aggr1_iops_4k -C "aggregate:total_transfers 4000 per_sec lower"

I found the object ID (513) for this aggr1 via the dfm aggr list command and the aggregate:total_transfers label by hovering over the column heading in the OM web interface and looking at the URL (did I mention I had a hard time finding Netapp docs on DFM/OM threshold creation?)
Now when you go to OM web interface Setup->Alarms you will see a new Event Name to trigger an alarm off (perf:aggr_iops_4k:breached, perf:aggr_iops_4k:normal).
I configured OM to send me an email if the IOPS breach the 4000 level (this is well in advance of the 6000 IOPS level which for this aggregate will result in all the webapp, Oracle, VM timeout issues).
Now I expect to never be surprised by this again ;)

Friday, September 10, 2010

Vib signature missing

Testing out ESXi 4.1 on a set of new Dell 610's in the lab I had no problem installing the Dell Openmanage VIB for remote hardware monitoring. But when attempting to install the Myricom driver for the 10GigE PCIe card I ran into this error:

Please wait patch installation is in progress ...
No matching bulletin or VIB was found in the metadata.No bulletins for this platform could be found. Nothing to do.

What this ended up indicating was the Myricom driver was for ESXi 4.0.* (as specified in the zip bundle's vmware.xml file) and did not match the ESXi 4.1 version so the driver installation is aborted.

I contacted Myricom for an updated driver and they did not have one ready yet but would I like to try a beta driver? Sure, this is the lab environment anyway, so I tried the new beta driver and received this error:

/tmp # esxupdate update --bundle=/tmp/
Encountered error VibSigMissingError:
The error data is:
Vib - /var/tmp/cache/cross_vmware-esx-drivers-net-
Errno - 20
Description - Vib signature is missing.

This was an expected error it turned out because the driver is beta, its not signed.
(I had also tried the route to remotely issue the update which failed with the same error)
The solution in this case is to add the --nosigcheck flag to the command to disregard the signature check:

esxupdate update --bundle=/tmp/ --nosigcheck

Up until now we have been all ESX, so besides breaking in the new hardware, this ESXi lab work is serving to help define the new administrative workflows, expose issues and develop solutions.
Since vmware announced ESX is done with this 4.1 version, we are investing the lab time now and do not plan to deploy any new ESX 4.1 hosts, instead we will deploy ESXi 4.1.

Saturday, July 24, 2010

Error Upgrading VMware Tools

Recently we upgraded to vSphere 4.1 and this as expected included a new version of VMware Tools.
Many of the tools upgrades we initiated from vCenter were immediately reported as failed with the very non-descriptive:

Error Upgrading VMware Tools.

Turns out, if you look at the corresponding vmware.log for the failed tools upgrade VM, you will see errors to the effect: "TOOLS INSTALL Error copying upgrader binary into guest"
For the windows failure, fix this by deleting the binary left over from the previous upgrade:

del C:\WINDOWS\Temp\VMwareToolsUpgrader.exe

For Linux failures, create the destination directory:

mkdir /tmp/vmware-root

and retry the VMware Tools upgrade from vCenter - it will proceed without error.

Hopefully VMware fixes this with a patch of some sort soon.

Friday, July 16, 2010

Quantifying VMDK Misalignment

I admit it, I'm a skeptic - but when there are many items competing for a limited number of waking hours, I've found a healthy measure of skepticism serves one's sanity well!
So when a vendor recommends a change or action I ask a few questions:
1) why - what are the technical reasons behind this recommendation?
2) how can I measure the current state and the effect of your recommendation?

Such was the case with Netapp's long standing recommendation to align VMs.
Until very recently the answers were
1) its less than optimal now with unaligned VMs
2) you can't measure it with customer tools - just trust us, unaligned VMs create inefficiencies and they could be causing your outages !

These vague and less than transparent answers combined with the fact Netapp's VM alignment tools require the VMs to be off for hours (depending on the size of the VM) meant the alignment recommendation was de-prioritized until we could
a) get clear answers
b) measure the before and after states

Last month we finally received both a technically sufficient explanation and a method to measure unaligned IO (via an advanced privileged stat) to re-prioritize the alignment project and organize VM downtime with all customers to get it done.

on the why: after several outages in 2 weeks we escalated to Netapp support and we finally had a Netapp engineer assigned who could explain from our perfstat captures the effect of the misaligned IO - how it was tipping ONTAP from normal mode to severely degraded (latency spike inducing , VM killing) mode - (and I'm paraphrasing here ;)

On the quantification:
It turns out one of the only ways to currently measure the effect of unaligned IO on your netapp is via the pw.over_limit counter. Its not standard (eg available via SNMP).
Its available only in "priv set advanced" command line mode - so a little effort was needed to get this counter tracked over time in our cacti trending system.

A cronjob kicks off the script to ssh into the Netapp head every 5 minutes and log the counter to a file for cacti to pickup and graph.
pw.over_limit counter

ssh netapp " priv set advanced ; wafl_susp -w " |& egrep pw.over | awk '{print $NF}'

Cacti, was then configured to graph the counter deltas every 5 minutes:

We immediately noticed time of day patterns of spikes in this stat and were able to use this information to isolate the source of more aligned IO (eg daily tripwire reports kicking off at 4am) - we had a couple latency spike outages in the 4-5am timeframe which before we were tracking this stat were a total mystery because the regular IOPs stats from Performance Advisor were showing things relatively idle during this time.

Alignment status: 2 weeks ago we had 110 VMs unaligned. Today its 34 and dropping daily thanks to the increased priority due to a fuller understanding of VM (mis)alignment and its effects and how to quantify those over time.

edit 12/1/10:
Exported CactiXML template for those interested

Thursday, July 15, 2010

vSphere 4.1 upgrade gotchas: ssh, vCenter changes

With the release of vSphere 4.1 this week, we upgraded the lab cluster to check out the new features - especially the vStorage API stats like latency & IOPS.
After instantiating a new vCenter VM (64bit windows 2008 - because 4.1 requires 64bit now) - I used the new 4.1 vCenter (loaded with the 4.1 upgrade zip file ( Each of the Dell 1950 nodes in the lab cluster completed the upgrade and reboot in under 15 minutes.
But what we found was we could not longer ssh into the 4.1 nodes with our user accounts.
In /var/log/messages we saw (because we could get in as root on the console):
error: PAM: Permission denied for useracct from sourceIP

The /etc/passwd accounts were intact, and we could su - useracct - so what changed?

Turns out on the bottom of page 65 of the vsp_41_upgrade_guide.pdf the user accounts now need to be listed as root group members to allow ssh for them:

NOTE After upgrading to ESX 4.1, only the Administrator user has access to the service console. To grant service console access to other users after the upgrade, consider granting the Administrator permissions to other users.

So editting the /etc/group file and adding all the users we had in the wheel group for sudo access to the root group fixed the issue immediately.

If VMware's intent was to get the attention of ESX users "Hey 4.1 is the last ESX version - get migrating to ESXi!" - mission accomplished :)

Other than those minor hiccups (64 bit required for vCenter, ssh breaking) we are impressed with all the new features and performance improvements VMware has packed into a "minor" (4.0->4.1) release.

Tuesday, July 6, 2010

VMware and Netapp Storage: deconstructing IOPS

Over 6 months ago in Dec 2009, we started experiencing unexplained latency spikes in our Netapp central storage. These would cause the VMware virtual machines (VMs) to crash - typically the linux VMs would revert to a read-only mode and the windows VMs would not recover without a reboot. The average latency was in the 10 millisecond range, but would spike over 1000 milliseconds (1 second) at seemingly random times off peak from our normal VM workloads. Netapp's Operations Manager (previously known as Data Fabric Manager (DFM)) logs the statistics and Performance Advisor (PA) is used to review/query the data.

We would open Netapp support cases, send in perfstats for analysis, but because we could not predict the timing or reproduce the spikes on demand we never had good data to for root cause analysis. We were told to align our VMs to improve Netapp performance but without any way to quantify the effect of misalignment, the project to align 100's of VMs was de-prioritized - especially since the Netapp alignment tool required the VM to be down for a number of minutes relative to the size of the VMs vmdk disk files.
In late May of 2010, the spikes started happening much more frequently (several per week) and we opened a new case uploading perfstats which now included the spike data. According to Netapp, there were several snapmirror operations in progress at the time of latest perfstat spike capture. This did not seem unusual since we had 6 volumes on this aggregate of 67 x 10 K RPM disks scheduled to snapmirror our VM data to another Netapp 3040 cluster 3-4 times per hour.
But it started me thinking - so how much of the disk IOPS are related to snapmirror operations? Performance Advisor was good at showing me the total IOPS for the aggregate, and IOPS for the volumes, but I wanted to map the IOPS to the function (snapmirror, dedup, NFS (VMware), etc)
I signed off the Netapp conference call that morning announcing I was going to deconstruct the total IOPS.
First a little background on Disk types and their inherent physical latency profiles.

Disks and the disk aggregates they comprise will exhibit a normal increase inaverage latency while the IOPS level (workload) increases. According to Netapp, at a certain point the IOPS and disk_busy percentage (a Netapp Performance Advisor metric) will become too high and a latency spike will result - this is expected behavior according to Netapp. For our 10K RPM aggregate this level turned out to be in the 120-150 IOPS per data drive. Performance Advisor was showing a strange flat consistent level of IOPS and disk_busy %

Since Netapp's analysis of the perfstat showed snapmirrors were busy at the time of the last spike, I decided to propose a temporary disabling of the snapmirrors for all 6 volumes on the busy aggregate. We did so, and the results were apparent from the PA disk_busy % view:

Of the 6800 IOPS per second on the aggregate, 3400 of them disappeared with snapmirror disabled (~50% of the IOPS were related to snapmirror operations!)
Only about 1000 IOPS/sec were related to the VMware NFS operations (< 15% of the 6800 total). After another conference call with Netapp to discuss the results, we decided we needed to reschedule our VM snapmirrors from 3-4/hour to once every 2 hours.
With this snapmirror modification we saw our disk busy % drop from 80% to 40%.
We could also document the IOPS load from dedup operations (they default to midnight) and aggr scrubs (default Sunday morning 2am-7am).
Now we realize the latency degradation profile is not slow and gradual, but a drastic collapse when the disks physical limits are reached, we are much more wary of adding any additional IOPS load (not just the traditional external NFS load - but also all the internal Netapp features like snapmirror that can actually dwarf the work your clients are doing - as was the case with us)
Or at least we can quantify which aggregates are approaching their critical IOPS thresholds and migrate load to other less busy aggregates.

I look forward to the additional vStorage API features expected in vSphere 4.x - including the datastore latency tracking and ability to configure QoS type priorities for sets of VMs to get IO resources before other sets.

Tuesday, June 8, 2010

VCP 410 Exam - passed!

Having worked exclusively with vSphere for over a year now, it was time to update my VCP3 to VCP4.
Today I took the VCP 410 exam and passed with a score of 431/500.
As I understand 300 is pass, 350 is required to be eligible as a VCI (instructor).
As I took the exam I marked 20 questions I was not 100% sure of for review (out of 85 questions total). I finished in about 1 hour (they give 90 minutes).
Passing this exam is required for VCP4 certification, along with the What's new in vSphere course I took over a year ago - I was told to expect a package from VMware in a few weeks.
Feels good all the studying paid off.
Having the VCP4 will allow me to purse the new advanced certifications announced last month by vmware.

Friday, April 30, 2010

iPad as mobile thin client

I had been using Wyse's PocketCloud app for the iPhone for a while, but found the screen real estate on the iPhone too limiting for everyday use - but with the iPad that limitation is gone.
Above are screenshots of PocketCloud connected via VMware View config to a Windows 7 linked clone running on a 2 node View 4 vSphere cluster.
I am using a VPN connection over wifi and administering the cluster via the VI Client!

So far I have not encountered any operations that I can not perform on this platform - PocketCloud provides a robust UI (including popup draggable mouse pointer for fine tuned clicking and right clicking, etc)

With the addition of multitasking rumored in the upcoming v4 of the OS, we can hopefully manage multiple connections and be allowed to leave the session and resume without having to reconnect as is now necessary in v3.x.

Monday, April 12, 2010

Xen + ZFS = The Free VI Experiment

Having wrapped up last year's Physical to Virtual project, I decided to take stock of the progress made in the free/opensource virtualization and storage areas - how would the featuresets of commercial hypervisors and storage stack up against their lowcost/free counterparts?

We are a VMWare + NetApp shop for the most part in our production VI - we like to be able to call support when we have an issue in production

Thanks to the dozens of servers we virtualized, we have plenty of hardware to use for this in the lab.
For this experiment I decided to start with Sun's ZFS. Recently the dedeplication feature was added to ZFS. I downloaded and installed build 129 of Solaris and created a 1 Tb zfs pool, then created filesystems and shared them out via NFS for the Xenservers to use.
I turned on dedup and copied 386Gb of VMs from Netapp volume to the ZFS filesystem.
Here is how the dedup savings stacked up:
fcocquyt@lab-zfs-01:~ 2:05pm 1 > zpool list
data1 928G 103G 825G 11% 4.01x ONLINE -
fcocquyt@lab-zfs-01:~ 2:06pm 2 > zfs list
data1/vms 386G 806G 386G /data1/vms
ZFS saved 386-103= 283Gb (283/386 = 73%)

Netapp (ONTap
netapp-01> df -sh
Filesystem used saved %saved
/vol/vm2/ 167GB 229GB 58%

Ok, so ZFS is able to get 73-58=15% better dedup ratio - nice for a free solution!

Then I loaded Xenserver 5.5 on 3 old Sunfire x2100's (AMD CPU) each with 2Gb ram
I installed Xencenter and created a pool and used the update tool in Xencenter to update them all to 5.5 U1 and configure the NFS datastore shared from the ZFS system.
I uploaded a CentoS5.3 ISO and Windows 7 ISO and used Xencenter to create a new VM of each flavor.
VM Migration: once I installed xentools I was able to live migrate (VMWare calls this vMotion) the new VMs around the pool's nodes live (each vmotion took < style="font-weight: bold;">What's missing:

In terms of the core featuresets there really is not much missing:
Netapp vs ZFS: both have snapshots, dedup, remote replication
VMWare vs Xenserver: both have snapshots, vMotion, P2V (Have not tried Xen's yet), centralized management

Conclusion: I was very impressed that in less than a day I could setup a free virtual infrastructure in the lab as a proof of concept to compare featuresets with the commercial VI solutions.

Pros and Cons of Blade Solutions for Virtual Infrastructure

In comparing blade vs traditional server solutions for virtual infrastrstructure (VI), there are pros and cons for both. Itemizing the advantages and disadvantages each solution will help an IT department justify and defend the decision to go one direction or the other. Of course Blades may be better suited for one customer and servers a better solution for another, there are many factors considered in making the ultimate decision. Below are a few of the more common aspects to consider when trying to decide on blades or servers.


Blade Server solutions:
In a blade chassis a set of blade servers will conventionally share power and cooling
More recently there is a new trend to also share networking and storage I/O through converged network adapters (eg Cisco's Unified Computing System (UCS))

Server solutions:
In contrast to the blade solution, each server is an autonomous with respect to its components - power, cooling etc - nothing is shared.

Pros and Cons:

Blade Solution:
  • Efficiency: via shared power and cooling blade servers offer better efficiency in these areas
  • Density: blade servers offer higher density per rack U for CPU resources (although this can be a con if your datacenter can not handle the power and cooling density)

  • Cost: Requires the additional expense of a chassis to house the blades
  • Lock-in: Chassis represents added level of vendor lockin due to the chassis investment (which may cost as much as $30,000 or several individual servers)
  • more lock-in: related to lock-in are the reduced negotiating power on pricing, and loss of business agility to go with best of breed as easily as when deploying standalone servers
  • All eggs in one basket (if the blade chassis has an issue or needs maintenance, all VMs hosted its blades will be down at once)
Standalone Server Solution:
Cost: no chassis to pay for - can take advantage of the latest competitive pricing
Business Agility: allows choice of best of breed technology (without the lock-in of the blade chassis)

Efficiency: takes more power and cooling per rack U than the shared blade chassis
Density: offers less rack U CPU resource density