Thursday, September 30, 2010

Setup ONTAP Simulator in vSphere

I recently had cause to run some tests on different Netapp ONTAP versions (I am working through a vFiler failover bug with Netapp) and wanted to see if the bug is present in 8.x (we are looking to move from 7.x - but not just yet) - I created 2 instances so I could test snapmirroring between them and vFiler failover.
Netapp does not yet provide an OVF version (OVF recently attained ANSI standard) of the simulator so this one is currently designed to boot directly in VMplayer/Workstation, Fusion, but with vSphere (ESX 4.1) I found the cleanest way is to run it through the converter:


Here are the steps to get the simulator up and licensed:
1) download the simulator (http://now.netapp.com/NOW/cgi-bin/simulator) to the windows VM you have the converter installed on
2) unzip the file and start the converter, click Convert Machine
3) select "VMware Workstation or other VMware virtual machine" and browse to the .vmx you just unzipped
4) give the rest of the info including destination, name, datastore etc
5) once converted, you can boot it up and press CTRL-C for the boot menu
6) select option 4 to clear the config and reboot into the setup
7) provide IP address, name etc to configure the simulator
8) optional: login and issue options httpd.admin.enable on to allow the web interface
9) licensing: get the license codes from http://now.netapp.com/NOW/download/tools/simulator/ontap/8.0/vsim_licenses_8.0.txt

Have fun recreating bugs ;) !

Monday, September 20, 2010

ESXi vmotion fails 10%

This one could be filed under "Unexpected networking differences between ESX and ESXi"

I was testing out adding a new ESXi 4.1 host to an existing cluster of ESX 4.1 hosts - the operation would time out and fail at 10% progress and I received this error each time I attempted a test vMotion from the ESX hosts to the new ESXi host:
"The VM Failed to resume on the destination during early power on"
I ran through some troubleshooting steps but did not realize the source of the problem until I SSH'ed to the ESXi box and attempted to touch a file on the (Netapp NFS) datastore and received:

Permission denied (Read-only filesystem)

But wait - I had added the vmkernel IP address to the netapp NFS export ACL with full read-write permission like I have done for all ESX hosts in the past.
It turned out, unlike ESX, ESXi was using the mgmt port IP for NFS and vMotion (even though vMotion was disabled for this port):

MGMT (vMotion disabled):


VMkernel (vMotion enabled):


once I added the mgmt port IP to the Netapp ACL and remounted the NFS datastore (truly read-write now), the vMotion succeeded. I'm left to determine why ESX uses the VMkernel port by default for NFS datastores and vMotion, but ESXi seems to default to the mgmt port.
At present the mgmt and VMkernel ports share the same networks, but this may not always be the case.
Comments welcome on this one!

Sunday, September 19, 2010

Netapp IOPS threshold alerting

This post could be considered part 2 of the Deconstructing IOPS post where we picked apart the total number of IO operations into their constituent sources to determine which source may be consuming more than expected. In the previous case it turned out we had stacked up snapmirror operations on top of each other (inadvertently over time by adding more and more volumes and snapmirror schedules with the same aggressive SLA).
We had squashed that one, rid ourselves of the VM killing latency spikes and returned our disk busy levels from critical levels (80%) down to normal (40%) as measured by Netapp's Performance advisor.
But about two weeks ago we started seeing new pages to sysadmins during our backup window (Fri evening - Saturday) - webapps timing out and losing DB connectivity (needing to be restarted), virtual machines responding slowly, oracle reporting many long running queries (that usually run quickly). So initially we focused on the backups - obviously they must be killing our performance - but no, nothing was new there network and storage wise. Then I logged back in to Performance Advisor and was stunned to see the IOPS on our busy AGGR1 back up to critical levels (6000+) - was it the snapmirrors again? No, a quick check of the schedule and status showed they were still on their idle every other hour schedules.
Time to repeat the IOPS decontruction exercise: What are the sources of these 6000 IOPS and what is this very regular 5 minute spike pattern in IOPS?


Looking at each volume's IO graph in PA, one quickly stood out as the most likely source with the same 5 minute spike pattern - ora64net which is our Oracle RAC storage.



From the long running queries list I was able to work with the DBA to determine the source was a new monitoring system we are testing (Zabbix) and specifically the orabbix database check.
I then worked with the sysadmin to determine which of the 55 orabbix database checks were causing the IO spikes. He disabled them in turn until it was discovered the audit log query was responsible (the query was operating on a million row unindexed table that was constantly growing from all of our monitoring logins!) - truncating the table to something managable was one solution - disabling the audit check was the immediate relief:



But how to not get surprised like this in the future?
Netapp operations manager custom thresholds and alarms!
Operations Manager not only tracks the metrics presented in performance advisor it provides a facility to configure custom thresholds via the cli DFM for use with Alarms - but you need to drop to the command line to configure these:
C:\ dfm perf threshold create -o 513 -d 300 -e aggr1_iops_4k -C "aggregate:total_transfers 4000 per_sec upper"
C:\ dfm perf threshold create -o 513 -d 300 -e aggr1_iops_4k -C "aggregate:total_transfers 4000 per_sec lower"


I found the object ID (513) for this aggr1 via the dfm aggr list command and the aggregate:total_transfers label by hovering over the column heading in the OM web interface and looking at the URL (did I mention I had a hard time finding Netapp docs on DFM/OM threshold creation?)
Now when you go to OM web interface Setup->Alarms you will see a new Event Name to trigger an alarm off (perf:aggr_iops_4k:breached, perf:aggr_iops_4k:normal).
I configured OM to send me an email if the IOPS breach the 4000 level (this is well in advance of the 6000 IOPS level which for this aggregate will result in all the webapp, Oracle, VM timeout issues).
Now I expect to never be surprised by this again ;)

Friday, September 10, 2010

Vib signature missing

Testing out ESXi 4.1 on a set of new Dell 610's in the lab I had no problem installing the Dell Openmanage VIB for remote hardware monitoring. But when attempting to install the Myricom driver for the 10GigE PCIe card I ran into this error:

Please wait patch installation is in progress ...
No matching bulletin or VIB was found in the metadata.No bulletins for this platform could be found. Nothing to do.


What this ended up indicating was the Myricom driver was for ESXi 4.0.* (as specified in the zip bundle's vmware.xml file) and did not match the ESXi 4.1 version so the driver installation is aborted.

I contacted Myricom for an updated driver and they did not have one ready yet but would I like to try a beta driver? Sure, this is the lab environment anyway, so I tried the new beta driver and received this error:

/tmp # esxupdate update --bundle=/tmp/myri10ge.zip
Encountered error VibSigMissingError:
The error data is:
Vib - /var/tmp/cache/cross_vmware-esx-drivers-net-
myri10ge_400.1.5.2-1OEM.vib
Errno - 20
Description - Vib signature is missing.



This was an expected error it turned out because the driver is beta, its not signed.
(I had also tried the vihostupdate.pl route to remotely issue the update which failed with the same error)
The solution in this case is to add the --nosigcheck flag to the command to disregard the signature check:

esxupdate update --bundle=/tmp/myri10ge.zip --nosigcheck

Up until now we have been all ESX, so besides breaking in the new hardware, this ESXi lab work is serving to help define the new administrative workflows, expose issues and develop solutions.
Since vmware announced ESX is done with this 4.1 version, we are investing the lab time now and do not plan to deploy any new ESX 4.1 hosts, instead we will deploy ESXi 4.1.