Sunday, September 19, 2010

Netapp IOPS threshold alerting

This post could be considered part 2 of the Deconstructing IOPS post where we picked apart the total number of IO operations into their constituent sources to determine which source may be consuming more than expected. In the previous case it turned out we had stacked up snapmirror operations on top of each other (inadvertently over time by adding more and more volumes and snapmirror schedules with the same aggressive SLA).
We had squashed that one, rid ourselves of the VM killing latency spikes and returned our disk busy levels from critical levels (80%) down to normal (40%) as measured by Netapp's Performance advisor.
But about two weeks ago we started seeing new pages to sysadmins during our backup window (Fri evening - Saturday) - webapps timing out and losing DB connectivity (needing to be restarted), virtual machines responding slowly, oracle reporting many long running queries (that usually run quickly). So initially we focused on the backups - obviously they must be killing our performance - but no, nothing was new there network and storage wise. Then I logged back in to Performance Advisor and was stunned to see the IOPS on our busy AGGR1 back up to critical levels (6000+) - was it the snapmirrors again? No, a quick check of the schedule and status showed they were still on their idle every other hour schedules.
Time to repeat the IOPS decontruction exercise: What are the sources of these 6000 IOPS and what is this very regular 5 minute spike pattern in IOPS?

Looking at each volume's IO graph in PA, one quickly stood out as the most likely source with the same 5 minute spike pattern - ora64net which is our Oracle RAC storage.

From the long running queries list I was able to work with the DBA to determine the source was a new monitoring system we are testing (Zabbix) and specifically the orabbix database check.
I then worked with the sysadmin to determine which of the 55 orabbix database checks were causing the IO spikes. He disabled them in turn until it was discovered the audit log query was responsible (the query was operating on a million row unindexed table that was constantly growing from all of our monitoring logins!) - truncating the table to something managable was one solution - disabling the audit check was the immediate relief:

But how to not get surprised like this in the future?
Netapp operations manager custom thresholds and alarms!
Operations Manager not only tracks the metrics presented in performance advisor it provides a facility to configure custom thresholds via the cli DFM for use with Alarms - but you need to drop to the command line to configure these:
C:\ dfm perf threshold create -o 513 -d 300 -e aggr1_iops_4k -C "aggregate:total_transfers 4000 per_sec upper"
C:\ dfm perf threshold create -o 513 -d 300 -e aggr1_iops_4k -C "aggregate:total_transfers 4000 per_sec lower"

I found the object ID (513) for this aggr1 via the dfm aggr list command and the aggregate:total_transfers label by hovering over the column heading in the OM web interface and looking at the URL (did I mention I had a hard time finding Netapp docs on DFM/OM threshold creation?)
Now when you go to OM web interface Setup->Alarms you will see a new Event Name to trigger an alarm off (perf:aggr_iops_4k:breached, perf:aggr_iops_4k:normal).
I configured OM to send me an email if the IOPS breach the 4000 level (this is well in advance of the 6000 IOPS level which for this aggregate will result in all the webapp, Oracle, VM timeout issues).
Now I expect to never be surprised by this again ;)

No comments: