Wednesday, February 16, 2011

Vfiler Non-disruptive Migration

We recently took delivery of upgrades for our aging NetApp 3040's (we ordered 3170's just before the 32xx came out - but that is another story!)
Using Data Motion for vFiler Migration (new in ONTAP 7.3.3 - see TR-3814), we were able to non-disruptively (with zero downtime) migrate all 25Tb of our storage services (15 vFilers) from the old hardware to the new hardware less than one week!
What follows is the summary of our first use of this highly valuable new feature, including gotchas and bugs we encountered along the way.

As it happened, we were unaware of this new feature at first, so we planned to use our documented procedures involving the disruptive vfiler dr activate command line.
Since we run all our NFS exports (a mix of Oracle, VMware datastores, web content, video repositories, app and log file shares) out of vFilers with existing DR snapmirror relationships, the plan under ONTAP 7.3.3 was going to involve considerable disruption and many error prone steps:

1) Shutdown/pause NFS clients access to the vFilers
2) failover to the DR vFilers (dr activate)
3) resume clients on DR vFiler
4) upgrade the 3040 heads to 3170 - test
5) re-establish DR vFiler snapmirrors in reverse direction (3040->3170)
6) failback to new 3170 (dr activate)

We began testing to work out any issues with the steps. Immediately we ran into a new duplicate IP issue we had not seen with previous versions of ONTAP. We opened a case with NetApp but were basically told that the duplicate IP was expected behavior for failover on the same subnet.

This forced us to start looking at other options and when we upgraded our Data Fabric Manager (DFM) also know as Operations Manager, and downloaded the new Network Management Console (NMC) also known as Provisioning Manager (PM) ;) - we noticed the new (for ONTAP 7.3.3) option to non-disruptively migrate vFilers from one cluster to another.

This tested fine, except when it brought down one head due to IO starvation BUG 90134 (Heavy file deletion loads can make a filer less responsive) (the head dropped off the net) - this was resolved by another Netapp support case where we re-prioritized volume deletion operations to avoid those ops swamping the head during the cleanup phase of the vFiler migration:

options wafl.trunc.throttle.system.max 30
wafl.trunc.throttle.vol.max 30
options wafl.trunc.throttle.mi
n.size 1530
options wafl.trunc.thro
ttle.hipri.enable off
options wafl.trunc.throttle.slice.enable off
Once these setting were in place, we had no issues with vFiler migrations disrupting production.

We proceeded cautiously to migrate our first vfiler (the least critical ISO / file repository vFiler).
The clients did not notice or log any issues so things were looking good for the non-disruptive aspect. However, we were less than excited that the existing vFiler migration had no way to import our existing multi-terabyte snapmirror relationships. To utilize PM's vFiler migration, we had to delete these snapmirror relationships with the DR vfiler, and recreate them (sometimes taking 18 hours) re-initializing from scratch.

Another issue we had was PM vFiler migrate "wizard" did not allow us to specify the destination
aggregate - however, a quick post to the community forums:

revealed the latest NMC 3.0D1 (released that week!) had added the option to select the destination aggregate. (Note: I found the responsiveness in the Storage Management Software forum to be amazing - on one occassion getting 3 solutions in one day!)

One other issue was vFiler migration failing due to unlicensed features (eg CIFS) on the destination.
We solved this with

vfiler disallow vfiler-vf-02 proto=cifs

We had successfully migrated VMware NFS datastore vFilers, Oracle on NFS vFilers, without a single issue logged on the client side, but there were a couple problematic vFilers erroring out for unspecified reasons, which were remaining preventing us from getting off the old 3040 hardware. For these we found running the same vFiler migrate from the command line actually proceeded without error.

vfiler migrate usage:

vfiler migrate [-l remote_login:remote_passwd] [-m method][-c secure]
[-e interfacename:ipaddr:netmask,...] remote_vfiler@remote_filer
vfiler migrate start [-l remote_login:remote_passwd] [-c secure]
[-e interfacename:ipaddr:netmask,...] remote_vfiler@remote_filer
vfiler migrate status remote_vfiler@remote_filer
vfiler migrate cancel [-c secure] remote_vfiler@remote_filer
vfiler migrate complete [-l remote_login:remote_passwd] [-c secure] remote_vfiler@remote_filer
Tracking Progress:
It was interesting to watch the migration monitor to get a glimpse into the internal operations on the backend making the migration non-disruptive:

In conclusion, Data Motion for vFiler Migration allowed us to upgrade from old 3040 hardware to the new 3170 cluster with ZERO downtime. Once the initial issues were resolved, the migrations were robust: either they were successful, or they would fail without affecting production. In conjunction with vmware vMotion, storage vMotion, NetApp's non-disruptive vFiler migrate also provides increased operational agility and efficiency (eg we migrated a vFiler from one head to the other within the cluster to balance load). I recently took the new Netapp Operations Manager (OPSMGR) class to learn more about the whole context of Provisioning and Protection Manager and how it fit with the new ONTAP 8 directions - look for my review of the class in an upcoming post!

1 comment:

Toni Määttä said...

Nice post. Good to know that this works as well. I guess you had new disk shelves on 3170 as well? Or did you take shelves from old box to a new box while migrating? You already had vFilers on use on old box or is it possible to create those on fly and move volumes from "root" vFiler to other vFilers?

We just did (ongoing) 3040 -> 3240 upgrade by creating "twin" metrocluster and just use storage vmotion to migrate servers. A bit non-supported, but works ;)

Thanks :)