Wednesday, February 16, 2011

Vfiler Non-disruptive Migration

We recently took delivery of upgrades for our aging NetApp 3040's (we ordered 3170's just before the 32xx came out - but that is another story!)
Using Data Motion for vFiler Migration (new in ONTAP 7.3.3 - see TR-3814), we were able to non-disruptively (with zero downtime) migrate all 25Tb of our storage services (15 vFilers) from the old hardware to the new hardware less than one week!
What follows is the summary of our first use of this highly valuable new feature, including gotchas and bugs we encountered along the way.

As it happened, we were unaware of this new feature at first, so we planned to use our documented procedures involving the disruptive vfiler dr activate command line.
Since we run all our NFS exports (a mix of Oracle, VMware datastores, web content, video repositories, app and log file shares) out of vFilers with existing DR snapmirror relationships, the plan under ONTAP 7.3.3 was going to involve considerable disruption and many error prone steps:

1) Shutdown/pause NFS clients access to the vFilers
2) failover to the DR vFilers (dr activate)
3) resume clients on DR vFiler
4) upgrade the 3040 heads to 3170 - test
5) re-establish DR vFiler snapmirrors in reverse direction (3040->3170)
6) failback to new 3170 (dr activate)

We began testing to work out any issues with the steps. Immediately we ran into a new duplicate IP issue we had not seen with previous versions of ONTAP. We opened a case with NetApp but were basically told that the duplicate IP was expected behavior for failover on the same subnet.

This forced us to start looking at other options and when we upgraded our Data Fabric Manager (DFM) also know as Operations Manager, and downloaded the new Network Management Console (NMC) also known as Provisioning Manager (PM) ;) - we noticed the new (for ONTAP 7.3.3) option to non-disruptively migrate vFilers from one cluster to another.

This tested fine, except when it brought down one head due to IO starvation BUG 90134 (Heavy file deletion loads can make a filer less responsive) (the head dropped off the net) - this was resolved by another Netapp support case where we re-prioritized volume deletion operations to avoid those ops swamping the head during the cleanup phase of the vFiler migration:

options wafl.trunc.throttle.system.max 30
wafl.trunc.throttle.vol.max 30
options wafl.trunc.throttle.mi
n.size 1530
options wafl.trunc.thro
ttle.hipri.enable off
options wafl.trunc.throttle.slice.enable off
Once these setting were in place, we had no issues with vFiler migrations disrupting production.

We proceeded cautiously to migrate our first vfiler (the least critical ISO / file repository vFiler).
The clients did not notice or log any issues so things were looking good for the non-disruptive aspect. However, we were less than excited that the existing vFiler migration had no way to import our existing multi-terabyte snapmirror relationships. To utilize PM's vFiler migration, we had to delete these snapmirror relationships with the DR vfiler, and recreate them (sometimes taking 18 hours) re-initializing from scratch.

Another issue we had was PM vFiler migrate "wizard" did not allow us to specify the destination
aggregate - however, a quick post to the community forums:

revealed the latest NMC 3.0D1 (released that week!) had added the option to select the destination aggregate. (Note: I found the responsiveness in the Storage Management Software forum to be amazing - on one occassion getting 3 solutions in one day!)

One other issue was vFiler migration failing due to unlicensed features (eg CIFS) on the destination.
We solved this with

vfiler disallow vfiler-vf-02 proto=cifs

We had successfully migrated VMware NFS datastore vFilers, Oracle on NFS vFilers, without a single issue logged on the client side, but there were a couple problematic vFilers erroring out for unspecified reasons, which were remaining preventing us from getting off the old 3040 hardware. For these we found running the same vFiler migrate from the command line actually proceeded without error.

vfiler migrate usage:

vfiler migrate [-l remote_login:remote_passwd] [-m method][-c secure]
[-e interfacename:ipaddr:netmask,...] remote_vfiler@remote_filer
vfiler migrate start [-l remote_login:remote_passwd] [-c secure]
[-e interfacename:ipaddr:netmask,...] remote_vfiler@remote_filer
vfiler migrate status remote_vfiler@remote_filer
vfiler migrate cancel [-c secure] remote_vfiler@remote_filer
vfiler migrate complete [-l remote_login:remote_passwd] [-c secure] remote_vfiler@remote_filer
Tracking Progress:
It was interesting to watch the migration monitor to get a glimpse into the internal operations on the backend making the migration non-disruptive:

In conclusion, Data Motion for vFiler Migration allowed us to upgrade from old 3040 hardware to the new 3170 cluster with ZERO downtime. Once the initial issues were resolved, the migrations were robust: either they were successful, or they would fail without affecting production. In conjunction with vmware vMotion, storage vMotion, NetApp's non-disruptive vFiler migrate also provides increased operational agility and efficiency (eg we migrated a vFiler from one head to the other within the cluster to balance load). I recently took the new Netapp Operations Manager (OPSMGR) class to learn more about the whole context of Provisioning and Protection Manager and how it fit with the new ONTAP 8 directions - look for my review of the class in an upcoming post!

Thursday, February 10, 2011

Apache MaxMemFree For Lean Memory Control

We were experiencing unexplained memory spikes on our CentOS Apache VMs.
This was a big problem because within the span of 5-10 minutes our Apache webserver VMs would go from 23% memory usage to swapping out to the NFS datastore seriously impacting performance until we restarted apache to clear the condition.

This was a non-trivial (interesting!) problem to analyze for several reasons:
1) The spikes were not easily tied to any particular large increase in number of requests (in fact the scoreboard showed most threads were idle during these memory spikes)

2) determining the constituent apache memory components contributing to the memory usage spikes was not made easy:
2a) top reports memory usage based on Shared pages not the REAL memory actually bound to that httpd process exclusively
2b) we did not know which apache requests or which apache processes were consuming the memory (out of the dozens of httpd processes and thousands of requests)
3) The restart fix was easy enough so the root cause analysis was deferred for other priorities.

Road to the Solution (skip to bottom for the Solution ;):

First we needed a way to tie to the httpd PIDs to the requests they were serving.
Our existing LogFormat did not include the PID for the httpd serving the request
Adding %P to the end solved this:

LogFormat "%h %V %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %D %P" combined

Next, getting the REAL memory consumption for these bloated httpd's was not happening with top. Turns out with Linux you need to get into the /proc/*/smap files and analyze the Private_Dirty entries (Credit to Hongli Lai and his helpful post for giving the seed of this script)

[root@web08 02]# more ~/ShowMem.csh

foreach i (`ls /proc/*/smaps`)
echo `grep Private_Dirty $i | awk '{ print $2 }' | xargs ruby -e 'puts ARGV.inject { |i, j| i.to_i + j.to_i }'` " " `head -1 $i | awk '{print $NF}'` " " $i

Running this through sort shows the highest REAL (private dedicated to that PID) memory:
[root@web08 02]# ~/ShowMem.csh | sort -nr | head

16576 /usr/sbin/httpd /proc/23691/smaps
15484 /usr/sbin/httpd /proc/24871/smaps
3432 /usr/sbin/httpd /proc/24734/smaps
3188 /usr/sbin/httpd /proc/25354/smaps
I then had the PID(s) of the most bloated httpds to search through the apache access logs.
I chose to focus on the LARGEST payload requests for these PIDs first.
Sort the access log by size of request:

awk '{print $(NF-1) " " $1 " " $2 " " $3 " " $4 " " $5 " " $7}' access_log.20110207 | sort -nr | more
Starting with a fresh restart of apache, so there were no bloated httpds yet, I then tested several of the high payload URLs from the access log while watching the output of repeated ShowMem.csh runs to catch any httpds growing.
Surprisingly I observed the httpds did not grow via "straight from the filesystem" 300mb+ mp4/mov files, but they did when the SAME FILE was served via mod_jk from the app layer!
(Quickly checked mod_jk was up to date and no fixes for memory in newer version)
I could not explain yet why this webapp/mod_jk combination caused apache to hold onto the memory in its smap anon space, but now I could readily reproduce and observe the issue at will (and that is 99% of the battle)

Armed with this info, I started researching for apache memory directives and quickly found

MaxMemFree !!

After adding

MaxMemFree 10000

I repeated the test and did not see desired effect advertised by the documentation.
Then I read in the forums the units may be Mb instead of KB as documented.
I then tried:

MaxMemFree 10

Repeating my test I observed the httpd serving the 300Mb mp4 video file via mod_jk balloon while serving the request to > 200Mb, then quickly free up this memory and return to 2.3Mb!
Success (MaxMemFree FTW!)
Our Apache instances are now running much leaner and effectively we've increased our capacity and eliminated our exposure to random requests bloating our httpd memory consumption.