Friday, September 23, 2011

ESXi5 Myricom 10g driver install

We are upgrading ESXi 4.1 hosts to ESXi 5 - some of which have Myricom 10g network cards installed.
Once the host is upgraded to ESXi 5 we need to install the new driver:
1) scp net-myri10ge-1.5.3-1OEM.500.0.0.406165.x86_64.vib root@esxi5-02:/var/log/vmware
2) ssh esxi5-02 as root
3) /var/log/vmware # esxcli software vib install -d
Installation Result
Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective.
Reboot Required: true
VIBs Installed: Myricom_bootbank_net-myri10ge_1.5.3-1OEM.500.0.0.406165
VIBs Removed:
VIBs Skipped:
4) reboot and verify the card is discovered (vCenter Config->Network->Network Adapters->Add)

There is also a new method to import these as "patch extensions" into vCenter, then attach them like baselines to hosts. My first pass at this went well, except the remediation step reported the patch was not applicable and it was skipped!

Monday, September 12, 2011

VUM upgrade SQL DB connection fails

I've upgraded two vCenter 4.x's to vCenter 5 now and ran into this both times.

While upgrading the Update Manager the error returned is (paraphrasing):

"Can not connect to the database, please verify username and password..."

If your vCenter has been through a series of upgrades (like mine) the installer may be referring to a DSN (32bit vs 64bit DSN) you have not updated with the correct credentials.

To update the DSN the installer is referring to, you need to run the 32-bit ODBC tool which is located at C:\Windows\SysWOW64\odbcad32.exe. Do NOT use the default odbcad32.exe located in the C:\Windows\System32 folder. While it has the same file name, they are two different files!!

Also, if you get an error generating the JRE SSL keys, re-running that step should succeed.

Again, check your DSN with the C:\Windows\SysWOW64\odbcad32.exe, not the default path odbcad32.exe

Sunday, September 11, 2011

VM start error Cannot open the disk

We ran into this issue when a VM went down unexpectedly and it would not restart cleanly - the error was:

"Cannot open the disk '/vmfs/volumes/4cf0daa-10489d5-4e6b-00219ba51425/ttw-db-09/ttw-db-09_3.vmdk' or one of the snapshot disks it depends on."

To debug, I ssh'ed into the ESXi 4.1 box to check out the file it was referring to (it was fine and readable and touchable per KB10051) and the vmware.log which indicated the actual issue:

"Cannot open the disk '/vmfs/volumes/4cf0daa-10489d5-4e6b-00219ba51425/ttw-db-09/ttw-db-09_3.vmdk' Failed to lock the file"

None of of the related KB's helped resolve the issue.
What turned out to be the solution came from a forum posting - just creating a temp snapshot and deleting cleared the condition and allowed the VM to power on cleanly.

thanks again to the vmware community forums - truly a great 24x7 resource!

Wednesday, August 17, 2011

Quantifying Spindle:VM throughput relationship

Last week we took delivery of an additional DS4243 300Gb x 24 15 RPM disk shelf.
This morning we connected it non-disruptively to our Netapp 3270.
These 24 disks were slated to be assigned to our existing aggregate consisting of 2 x DS4243 shelves, effectively becoming 1/3 of a the spindles (IOPs and storage) of the newly expanded aggregate.
Before adding the disks to expand the aggregate we wanted to take a benchmark from a VM's perspective BEFORE and then compare it to the same VM's performance AFTER the 1/3 IOPs upgrade.


Netapp 3270 running Ontap
10Gb connection to ESXi 4.1 host
HD Tune is used for disk IO benchmark

BEFORE (46 disk Aggregate):

Read transfer rate
Transfer Rate Minimum : 21.5 MB/s
Transfer Rate Maximum : 95.9 MB/s

Transfer Rate Average : 65.8 MB/s
Access Time : 10.3 ms

AFTER (67 Disk Aggregate):

Read transfer rate
Transfer Rate Minimum : 0.6 MB/s
Transfer Rate Maximum : 96.7 MB/s
Transfer Rate Average : 82.9 MB/s
Access Time : 6.67 ms


Throughput: Avg Transfer rate went from 65 to 82.9 (27.5% better)
Latency: Access Time went from 10.3 to 6.67ms (35% improvement)
Also you can clearly see the deviation from the average performance is much improved (The 2nd throughput graph shows transfers rate staying in a much tighter 80-90Mbsec range than the 1st smaller aggr) - this translates into more deterministic performance profile for our VI (the larger aggr can “soak up” the short IOP demand spikes that would have otherwise slowed down the smaller aggr)

Note: it was necessary to clone the "Before" VM to force WAFL to stripe out the VM data onto the newly added disks (Will WAFL do this automatically over time for existing data? Or will only NEW VMs realize the full 67 spindle benefits?)

Monday, August 15, 2011

vCenter SQL Server Scheduled Job Errors

Since migrating to SQL 2008, our app event log was showing errors of the type:

SQL Server Scheduled Job 'Past Day stats rollupVIM_UMDB' (0x2838E6A98D1EBE4CB211E1768836BA68) - Status: Failed - Invoked on: 2011-08-14 17:00:00 - Message: The job failed. Unable to determine if the owner (DOM\dom.acct) of job Past Day stats rollupVIM_UMDB has server access (reason: Could not obtain information about Windows NT group/user 'DOM\dom.acct', error code 0x5. [SQLSTATE 42000] (Error 15404)).

All other vCenter operations seemed to fine, but these jobs were failing and logging every 5-10 minutes.
Turns out the solution is documented in KB15404

The fix is to change the owner of these jobs from the domain account to SA:

Friday, August 5, 2011

False management network redundancy alert

We upgraded our DR cluster hosts recently and ran into this where vCenter was reporting:

HostXYZ "currently has no management network redundancy"

But wait - thats not true, the vSwitch clearly has two active connections!
I tried reconfiguring the ordering of the nics in the teaming config, but the warning persisted.

Then I found KB1004700 which starts talking about ignoring/suppressing the warning (very bad practice to ignore alerts! for what should be obvious reasons).
But I kept reading and saw:

"Note: If the warning continues to appear, disable and re-enable VMware High Availability in the cluster."

So I disabled and enabled HA and this cleared the alert - no suppression of alerts needed!

Wednesday, July 20, 2011

Growing the Cloud: P2V, P2E, V2E

Over time, our Modus Operandi for virtualizing has evolved.
When we started over 5 years ago, we had a small ESX 3.x cluster and did many P2V (Physical to Virtual) conversions.
Over the next 4 years we virtualized 20-25% per year with P2V until today we sit at about the 90% virtualization level in our datacenters with 21 ESXi hosts backed by 20Tb of storage all on a 10Gb network.
Now the frequency of P2Vs as decreased as we now persuade customers to go directly virtual with their new projects by buying a piece of VI (Virtual Infrastructure) - essentially "growing the cloud".
Recently we had a customer who had already purchased their hardware and it had substantial warrantee support remaing - so we decided to:

P2V the existing webapp to the VI
P2E (Physical to ESXi - install ESXi over the old OS)
V2E (Virtual back to new ESXi - storage vMotion)

The only interruption was 10 minutes while the webapp was suspended to do a clean P2V.

The customer was happy to have the virtual space to create dev instance VMs and we were happy to continue to add capacity to our VI (and grateful for the flexibility virtualization provides to accomplish these kinds of migrations)

Friday, July 1, 2011

Welcome to the VMware vExpert 2011 Program

This morning I received an email "Welcome to the VMware vExpert 2011 Program"!
I am very honored to be in the same company as many of the great vExperts I follow.

"We're pleased to designate you as a vExpert 2011 as recognition of your contributions to the VMware, virtualization, and cloud computing communities. You’ve done work above and beyond, and we’re delighted to communicate more closely, to share resources, and to offer other opportunities for greater interaction throughout the year as we continue to grow knowledge and success in the community of IT professionals. Welcome to the vExpert 2011 Program!"

What a great way to start the long weekend!

Tuesday, June 7, 2011

powershell restart windows service

File this one under windows automation.

We are transitioning to a new Zabbix monitoring installation.
We discovered after a network outage several windows snmp services needed to be restarted to restore monitoring.

Here is the powershell pipeline I used to do this:

(there is no restart method so we call stop, then start)
Assumes TCP port 135 is open to the servers listed in snmprestartlist.txt

[vSphere PowerCLI] C:\> get-content c:\snmprestartlist.txt | foreach-object {Get-Service -ComputerName $_ | Where-Object { $_.Name -eq "SNMP" } | ForEach-Object {$_.Stop() }}

[vSphere PowerCLI] C:\> get-content c:\snmprestartlist.txt | foreach-object {Get-Service -ComputerName $_ | Where-Object { $_.Name -eq "SNMP" } | ForEach-Object {$_.Start() }}

Check the status is now running:
[vSphere PowerCLI] C:\> get-content c:\snmprestartlist.txt | foreach-object {Get-Service -ComputerName $_ | Where-Object { $_.Name -eq "SNMP" } | Format-List *}

Friday, May 27, 2011

Extending Windows Drive Live

I've been using Dell's excellent ExtPart.exe for resizing Windows 2003 drives live.
But I found recently this utility does not work on Windows 2008 64 bit.

So the updated procedure for windows 2008 is:

1) use vCenter to grow the Windows 2008 drive
2) run Windows 2008 Server Manager->Storage->Disk Management->Action->Rescan Disks to have the added space from step one recognized
3) run diskpart.exe from the command line, list volume, select volume X, extend, list volume to verify space is added (live)!


Sunday, April 24, 2011

shrinking snapmirror destination volumes

When the initial snapmirror relationship is setup, we typically have source and destination volumes of equal size. But when the source volume is shrunk:

vol size appdata -200g

the destination remains at its original size. The problem is we now have unclaimed space that is not reported by the usual tools (df, system manager etc) - the tools report what the source volume reports.

Over time you can end up with snapmirror destination volumes too large and the space allocated during the initial snapmirror is effectively not available for new snapmirror destination volumes (or any other use) until you free it by shrinking the snapmirror destination volume (example volume is "vol6" below):

0) df -A (check initial Aggregate usage level)
aggr1 14036753204 12158892092 1877861112 87%

1) snapmirror break vol6
2) vol options vol6 fs_size_fixed off
3) df -A (verify the space is returned to the aggr for reuse)
aggr1 14036753204 11812443012 2224310192 84% But how do you get ONTAP to report the unused space without going through the procedure above?

One method is

vol status -b

this will report the volume size and filesystem size in 4k blocks - I added some awk to show the difference in gb:

[root@backup1 ~]# ssh netapp-01 -l root vol status -b | awk '{print $0" diff = "($3-$4)*4096/1024/1024" gb"}'
Volume Block Size (bytes) Vol Size (blocks) FS Size (blocks)
------ ------------------ ------------------ ----------------
vol0 4096 25952256 25952256
vm6 4096 684510413 590558004 diff = 367002 gb
sg1 4096 8912896 8650752 diff = 1024 gb
web2 4096 14417920 13107200 diff = 5120 gb
backup1 4096 5242880 3932160 diff = 5120 gb
data1 4096 15204352 14417920 diff = 3072 gb
vm2 4096 667942912 563714458 diff = 407142 gb
archive1 4096 418906112 402653184 diff = 63488 gb
ora6 4096 275251200 251658240 diff = 92160 gb
fcapdata 4096 18350080 14417920 diff = 15360 gb
apdata 4096 367001600 348966093 diff = 70451.2 gb
ora4 4096 216006656 183500800 diff = 126976 gb
backup2 4096 443862221 322122548 diff = 475546 gb

Now this takes the guesswork out of where the space is overallocated and I can use the diff numbers to shrink the snapmirror volumes

Update 3/16/15

snapmirror break vm6
vol options vm6 fs_size_fixed off # This returns the space to the aggregate
snapmirror resync vm65net

Wednesday, April 20, 2011

CPU resource shares bug

Folks in the VMware forums have long noticed this warning when attempting to move VMs into resource pools:

"The CPU resource shares for (vm you are trying to move) are much lower than the virtual machine's in the resource pool. With its current share setting, the virtual machine will be entitled to 2% of the CPU resources in the pool. Are you sure you want to do this?"

The workaround I found is to merely view (without changing anything) the Resouces settings for the VM: Right Click VM->Edit Settings...->Resources Tab -> OK

vCenter reports Reconfigure Virtual Machine (but we did not change anything!)

However, there are other variations of the message where this workaround does not resolve the message (eg "much higher" or memory instead of CPU resource) - this will be worth opening a case to resolve these other cases

Friday, April 15, 2011

vMotion Unicast Flood ESXi

Our vmware mgmt and vmotion nics share the same IP space.
This is not VMware best practice - they recommend vmkernel/vMotion traffic be isolated in its own IP space.


While running ESX 4.1 we had no issues associated with the shared IP space, but once we started migrating to ESXi we noticed more and more network disruption during vMotion (especially bulk vMotions when migrating 10-20 VMs for ESXi maintenance). We noticed switches reporting high collisions and drops, and the Juniper firewall load would spike. Forum threads revealed others experiencing the same issue.


The issue is resolved by adding a new vMotion vNIC in a private IP space.
This is a best practice recommendation I previously bel
ieved would require a network topology configuration design change with downtime.
But since the
vMotion traffic does not route outside the cluster vmware support was able to demonstrate its as simple as adding a new vmk (vMotion) vNIC dedicated to vMotion ) vmk3 below:

~ # esxcfg-route -l
VMkernel Routes:
Network Netmask
Gateway Interface Local Subnet vmk2 255
.255.255.0 Local Subnet vmk0
default vmk0
Add a new vmkernel port (in VI client: Select your ESXi host->Configuration Tab->Networking->vSwitch Properties->Add->VMkernel):

Choose a private IP space - it does not even need to match the existing default gateway setting since according to vmware support the vMotion traffic is not routed anyway. (I chose the convention of modifying the first octet of our existing IP to make it 10.x.y.z and updating our Networking records accordingly)
~ # esxcfg-route -l
VMkernel Routes:
Network Netmask Gateway Interface Local Subnet vmk2 Local Subnet vmk3 <----new vMotion private net Local Subnet vmk0
default vmk0
~ #

Note: when you use the VI client to add a vMotion port, the previous vMotion port has its vMotion bit DISABLED (since only one vMotion port is allowed)

Made the change on 2 ESXi hosts, tested bulk vMotions did not cause the network disruptions anymore, then rolled out the changes to the rest of our cluster nodes including ESX hosts for consistency.

This was a satisfying resolution all round in that our configuration was brought into line with best practices with zero downtime and the network disruption during vMotion issue was addressed at its root cause.

Wednesday, April 6, 2011

FusionIO ESXi PVSCSI VM Benchmarking

FusionIO recently (3/27/11) released their ESXi 4.1 drivers:

So, I took the opportunity to put the 600Gb ioDrive Duo through some VM benchmarks.
NOTE: If you get VIB Signature errors installing the driver like I did - see:

With the release of these drivers, there was finally native support for FusionIO datastores for ESXi.
(Previously folks were doing things like running Starwind to export the FusionIO over iSCSI)

Lab Config:
Dell 1950 Dualk Quad Core Intel E5440 2.83GHz with 16Gb RAM
ESXi 4.1 U1

Benchmark Results:

HD Tune:

Default LSI SAS VM SCSI controller: 1222 MB/sec

PVSCSI VM SCSI driver: 1368MB/sec


Default LSI SAS VM SCSI Driver: 20721 IOPS @ 79% CPU

PVSCSI Driver: 21836 IOPS @ 33% CPU


These FusionIO throughput and IOPs numbers are around 4 times better than the Netapp 3040 40 disk aggr numbers obtained in previous benchmarks.

(1368-1222)/1222 = 11.9% better throughput with PVSCSI
IOMeter Shows: 79-33 = 46% less CPU with PVSCSI

Thursday, March 31, 2011

Upgrade vCenter SQL 2005 Express to SQL 2008 Standard

vCenter comes with SQL Server 2005 Express - works fine for smaller environments, but after a while your environment (# ESX servers and # of VMs) will grow and you will exceed the 4Gb licensed limit for this "free" SQL 2005 Express instance.

Once exceeded, your vCenter service will crash and fail on restart.

The error in your vCenter server vpxd log file will look like:

Error inserting events: "ODBC error: (42000) - [Microsoft][SQL Native Client][SQL Server]Could not allocate space for object 'dbo.VPX_EVENT_ARG'.'PK_VPX_EVENT_ARG' in database 'VIM_VCDB' because the 'PRIMARY' filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup." is returned when executing SQL statement "INSERT INTO VPX_EVENT_ARG

you can buy yourself some space by purging old records - but to address the root cause I chose to upgrade to a full SQL Server 2008.
Below are the steps to accomplish a vCenter SQL Server 2005 Express upgrade to SQL Server 2008 Standard (casting off the 4Gb limit). My vCenter 4.1 U1 runs as a VM in Windows 2008 64-bit Standard guest OS. Since the vCenter runs as a VM I forked off a clone to test the upgrade steps while not affecting the production vCenter. The upgrade path I used: SQL 2005 Express -> SQL 2008 Express -> SQL 2008 Standard:

SQL 2005 Express -> SQL 2008 Express:

1) Free up 5-10Gb on your vCenter server - windirstat is one of my favorite tools for identifying what is eating space on windows VMs
2) Take a snapshot of your vCenter VM "preSQL upgrade" (just in case)
3) download SQL 2008 Express:
4) run the SQLEXPRWT_x64_ENU.exe to upgrade to SQL 2008 Express
5) You will likely get an error towards the end:
error on SQL Server 2005 tools - “Please remove” -

To Re-run with success, you need to rename the registry entry:
HKLM\SOFTWARE\Wow6432Node\Microsoft\Microsoft SQL Server\90\Tools out of the way (eg HKLM\SOFTWARE\Wow6432Node\Microsoft\Microsoft SQL Server\90\Tools_old)
6) once the registry entry is renamed, click the re-run button to retry without restarting the upgrade

SQL 2008 Express -> SQL 2008 Standard:

1) Purchase a SQL Server 2008 Standard license to obtain a valid product code
2) Download the SQL Server 2008 Standard install
3) run the setup.exe with a special parameter "SKUUPGRADE=1" from the cmd prompt
4) Don't choose upgrade existing - choose "New install or add features to existing..." for the type of install.
5) Select all features when prompted
5) Supply the credentials for SQL services
6) This will run for a good 30+ minutes.
7) At the end your vCenter service will fail to start with the following in the log:

Failed to create https proxy: An attempt was made to access a socket in a way forbidden by its access permissions.

8) Turns out ( clued me into this) SQL 2008 brings a reporting service on port 80 - CONFLICTING with vCenter! - disable the reporting service
9) restart vCenter (I rebooted to ensure clean startup)
10) connect with VI Client to verify all is good!
11) Once the vCenter functions are all verified remember to delete the preSQL snapshot

While figuring out how to jump through these upgrade hoops I was wishing for a MySQL option for vCenter - apparently in 2009 VMware attempted a beta of this, but it was abandonded - today the options are SQL or Oracle for the vCenter DB.

UPDATE 4/4/11:
I noticed the SQLEXP_VIM database was not migrated to the SQL Server 2008 instance.
I needed to detach it from the 2005 instance and re-attach it to the 2008 instance according to:

Once this was done, we were finally rid of the 4Gb limit on the vCenter SQL DB!

32 bit DSN required by Update Manager:
"The DSN, ‘VUM’ does not exist or is not a 32 bit system DSN. Update Manager requires a 32 bit system DSN."

Cause: Using the ODBC tool in the Control Panel will create a 64-bit DSN. You need to use the 32-bit ODBC tool which is located at C:\Windows\SysWOW64\odbcad32.exe. Do NOT use the odbcad32.exe located in the C:\Windows\System32 folder. While it has the same file name, they are two different files!! - I was running the wrong exe by default...whew!

fix domain errors with newsid.exe

Maybe you are a unix guy like me and you sometimes forget when cloning one-off temporary Windows VM's for testing to run sysprep or newsid.exe to avoid errors like:

The system cannot log you on due to the following error: The name or
security ID (SID) of the domain specified is inconsistent with the
trust information for that domain.
The officially supported method is of course to use sysprep (perfect for the cloning from a prepared template use case)

But for the other use cases (eg testing a P2V or an upgrade of VC VM etc) where official support is not as important (the cloned VM will be deleted once testing is done), newsid works just fine.

Microsoft documents link to sysprep, but here is the small utility downloadable for posterity


Again, this is unsupported by microsoft - use sysprep for production use

Wednesday, March 23, 2011

vmware-tools install error

While installing/updating vmware tools (VMwareTools-8.3.2-257589.tar.gz) for linux guests, I sometimes receive the error:

Unable to create symbolic link "/usr/lib/vmware-tools/bin"


Unable to create symbolic link "/usr/lib/vmware-tools/libconf"

Seems the installer has a bug where it fails to remove the existing directory before creating the symlink (it prompts to overwrite, but of course overwriting a dir with a symlink is not possible - therefore I'd call this a BUG)

The solution is to simply remove the existing dirs:

rm -rf /usr/lib/vmware-tools/libconf
rm -rf /usr/lib/vmware-tools/bin

and run the install again:

./ -d

Monday, March 14, 2011

Apache Optimal vCPU Analysis

Last week I posed a question in the VMware forums:

How to determine optimal vCPU for apache workloads given a specified hardware and software configuration?

I prefaced the question by stating we strictly adhere to the best practice of keeping the vCPU at 1 unless the workload is multithreaded and capable of benefitting from the additional vCPUs.
Given the highly multithreaded nature of apache we had set the vCPU to 8, but without any numbers to quantify this was the optimal value this was more of an intuitive configuration based on our workloads and knowledge of ESX.

Without any feedback in the week following the posting, I took it as an opportunity to design an experiment to measure the effect of varying the vCPU on apache throughput, latency, response time etc.

What follows are some unexpected observations that may or may not be useful to others looking at tuning vCPU for their environments.

Experiment design:
The goal of this experiment is to measure the effects of varying the vCPU setting of a CentOS 5 Apache webserver VM. For generating the web server load I chose apachebench.

cloned a production web server for testing (changing only its IP)

apache version 2.2.3

running in Centos5 VM

with 8Gb RAM allocated

Threads (via scoreboard) range from 50-100 active (with max 300)

vSphere ESXi 4.1 U1 (build 348481)

Hardware: Dell R610 with 2 x 6 Core Intel Xeon X5680 CPUS @ 3.3GHz

Starting with a vCPU setting of 1, I ran the following script to iterate from 25 to 175 concurrent requests in increments of 25 - (the URL was an average page of 50K and the 10000 requests took about 1 minute to serve total):

# Script to increase concurrent requests

set x = 25
while ( $x -lt 200)
ab -n 10000 -c $x http://web-06/

@ x+=25

The output for each vCPU# run was captured, then the VM was brought down to increase the vCPU setting to 2, 4, 6, 8 - capturing the results for each of the 5 vCPU levels. Bringing the results into excel tables the following metrics were compared across the vCPU runs (and I re-ran the vCPU runs later to confirm the data per vCPU config was not varying wildly from run to run) :

Total Connect Time:
This is the total time (Connect, Processing, Waiting) in milliseconds spent serving the request (we want this to be as low as possible. Observe
the 1 vCPU configuration is markedly higher than the 2, 4, 6, 8 configurations.

Deviation: The following apache bench metric accounts for how variable the total time to serve is - ideally we want this to be as small as possible so our apache performance is deterministically consistent. Observe how the 2 and 4 vCPU deviation is markedly higher than the 1 and 8 vCPU configurations (y-axis is milliseconds deviation from mean)

Requests per second: This metric measures the average requests per second served (x-axis is requests served per second). Below we can see the 2 and 4 vCPU configurations outperform the rest, but we have to remember this comes at the expense of increased variability. We are beginning to see the tradeoff: the 2 or 4 vCPU config will give most users slightly better response times, but the 8 vCPU config's behavior is more deterministic and gives a more consistently decent response time.

: This metric measures the Kb/sec delivered by the apache instance at each vCPU setting. It mirrors the requests/sec metric - 2 and 4 vCPU configs deliver higher throughput on average, but at the expense of higher variability - giving some requests much lower throughput.

At our average peak thread load (call it 75 requests/second per apache instance) we see that while the 4 vCPU config will deliver 9.4% higher throughput ON AVERAGE, than the 8 vCPU config, the 8 vCPU config provides 45.3% better (lower) variability from that average response. All other things equal, the decision to keep the 8 vCPU configuration for our apache VM instances is therefore easily rationalized in the context of giving most users a slightly lower throughput while guaranteeing their worst response will be much better with the 8 vCPU vs the 4vCPU config.

Wednesday, February 16, 2011

Vfiler Non-disruptive Migration

We recently took delivery of upgrades for our aging NetApp 3040's (we ordered 3170's just before the 32xx came out - but that is another story!)
Using Data Motion for vFiler Migration (new in ONTAP 7.3.3 - see TR-3814), we were able to non-disruptively (with zero downtime) migrate all 25Tb of our storage services (15 vFilers) from the old hardware to the new hardware less than one week!
What follows is the summary of our first use of this highly valuable new feature, including gotchas and bugs we encountered along the way.

As it happened, we were unaware of this new feature at first, so we planned to use our documented procedures involving the disruptive vfiler dr activate command line.
Since we run all our NFS exports (a mix of Oracle, VMware datastores, web content, video repositories, app and log file shares) out of vFilers with existing DR snapmirror relationships, the plan under ONTAP 7.3.3 was going to involve considerable disruption and many error prone steps:

1) Shutdown/pause NFS clients access to the vFilers
2) failover to the DR vFilers (dr activate)
3) resume clients on DR vFiler
4) upgrade the 3040 heads to 3170 - test
5) re-establish DR vFiler snapmirrors in reverse direction (3040->3170)
6) failback to new 3170 (dr activate)

We began testing to work out any issues with the steps. Immediately we ran into a new duplicate IP issue we had not seen with previous versions of ONTAP. We opened a case with NetApp but were basically told that the duplicate IP was expected behavior for failover on the same subnet.

This forced us to start looking at other options and when we upgraded our Data Fabric Manager (DFM) also know as Operations Manager, and downloaded the new Network Management Console (NMC) also known as Provisioning Manager (PM) ;) - we noticed the new (for ONTAP 7.3.3) option to non-disruptively migrate vFilers from one cluster to another.

This tested fine, except when it brought down one head due to IO starvation BUG 90134 (Heavy file deletion loads can make a filer less responsive) (the head dropped off the net) - this was resolved by another Netapp support case where we re-prioritized volume deletion operations to avoid those ops swamping the head during the cleanup phase of the vFiler migration:

options wafl.trunc.throttle.system.max 30
wafl.trunc.throttle.vol.max 30
options wafl.trunc.throttle.mi
n.size 1530
options wafl.trunc.thro
ttle.hipri.enable off
options wafl.trunc.throttle.slice.enable off
Once these setting were in place, we had no issues with vFiler migrations disrupting production.

We proceeded cautiously to migrate our first vfiler (the least critical ISO / file repository vFiler).
The clients did not notice or log any issues so things were looking good for the non-disruptive aspect. However, we were less than excited that the existing vFiler migration had no way to import our existing multi-terabyte snapmirror relationships. To utilize PM's vFiler migration, we had to delete these snapmirror relationships with the DR vfiler, and recreate them (sometimes taking 18 hours) re-initializing from scratch.

Another issue we had was PM vFiler migrate "wizard" did not allow us to specify the destination
aggregate - however, a quick post to the community forums:

revealed the latest NMC 3.0D1 (released that week!) had added the option to select the destination aggregate. (Note: I found the responsiveness in the Storage Management Software forum to be amazing - on one occassion getting 3 solutions in one day!)

One other issue was vFiler migration failing due to unlicensed features (eg CIFS) on the destination.
We solved this with

vfiler disallow vfiler-vf-02 proto=cifs

We had successfully migrated VMware NFS datastore vFilers, Oracle on NFS vFilers, without a single issue logged on the client side, but there were a couple problematic vFilers erroring out for unspecified reasons, which were remaining preventing us from getting off the old 3040 hardware. For these we found running the same vFiler migrate from the command line actually proceeded without error.

vfiler migrate usage:

vfiler migrate [-l remote_login:remote_passwd] [-m method][-c secure]
[-e interfacename:ipaddr:netmask,...] remote_vfiler@remote_filer
vfiler migrate start [-l remote_login:remote_passwd] [-c secure]
[-e interfacename:ipaddr:netmask,...] remote_vfiler@remote_filer
vfiler migrate status remote_vfiler@remote_filer
vfiler migrate cancel [-c secure] remote_vfiler@remote_filer
vfiler migrate complete [-l remote_login:remote_passwd] [-c secure] remote_vfiler@remote_filer
Tracking Progress:
It was interesting to watch the migration monitor to get a glimpse into the internal operations on the backend making the migration non-disruptive:

In conclusion, Data Motion for vFiler Migration allowed us to upgrade from old 3040 hardware to the new 3170 cluster with ZERO downtime. Once the initial issues were resolved, the migrations were robust: either they were successful, or they would fail without affecting production. In conjunction with vmware vMotion, storage vMotion, NetApp's non-disruptive vFiler migrate also provides increased operational agility and efficiency (eg we migrated a vFiler from one head to the other within the cluster to balance load). I recently took the new Netapp Operations Manager (OPSMGR) class to learn more about the whole context of Provisioning and Protection Manager and how it fit with the new ONTAP 8 directions - look for my review of the class in an upcoming post!

Thursday, February 10, 2011

Apache MaxMemFree For Lean Memory Control

We were experiencing unexplained memory spikes on our CentOS Apache VMs.
This was a big problem because within the span of 5-10 minutes our Apache webserver VMs would go from 23% memory usage to swapping out to the NFS datastore seriously impacting performance until we restarted apache to clear the condition.

This was a non-trivial (interesting!) problem to analyze for several reasons:
1) The spikes were not easily tied to any particular large increase in number of requests (in fact the scoreboard showed most threads were idle during these memory spikes)

2) determining the constituent apache memory components contributing to the memory usage spikes was not made easy:
2a) top reports memory usage based on Shared pages not the REAL memory actually bound to that httpd process exclusively
2b) we did not know which apache requests or which apache processes were consuming the memory (out of the dozens of httpd processes and thousands of requests)
3) The restart fix was easy enough so the root cause analysis was deferred for other priorities.

Road to the Solution (skip to bottom for the Solution ;):

First we needed a way to tie to the httpd PIDs to the requests they were serving.
Our existing LogFormat did not include the PID for the httpd serving the request
Adding %P to the end solved this:

LogFormat "%h %V %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %D %P" combined

Next, getting the REAL memory consumption for these bloated httpd's was not happening with top. Turns out with Linux you need to get into the /proc/*/smap files and analyze the Private_Dirty entries (Credit to Hongli Lai and his helpful post for giving the seed of this script)

[root@web08 02]# more ~/ShowMem.csh

foreach i (`ls /proc/*/smaps`)
echo `grep Private_Dirty $i | awk '{ print $2 }' | xargs ruby -e 'puts ARGV.inject { |i, j| i.to_i + j.to_i }'` " " `head -1 $i | awk '{print $NF}'` " " $i

Running this through sort shows the highest REAL (private dedicated to that PID) memory:
[root@web08 02]# ~/ShowMem.csh | sort -nr | head

16576 /usr/sbin/httpd /proc/23691/smaps
15484 /usr/sbin/httpd /proc/24871/smaps
3432 /usr/sbin/httpd /proc/24734/smaps
3188 /usr/sbin/httpd /proc/25354/smaps
I then had the PID(s) of the most bloated httpds to search through the apache access logs.
I chose to focus on the LARGEST payload requests for these PIDs first.
Sort the access log by size of request:

awk '{print $(NF-1) " " $1 " " $2 " " $3 " " $4 " " $5 " " $7}' access_log.20110207 | sort -nr | more
Starting with a fresh restart of apache, so there were no bloated httpds yet, I then tested several of the high payload URLs from the access log while watching the output of repeated ShowMem.csh runs to catch any httpds growing.
Surprisingly I observed the httpds did not grow via "straight from the filesystem" 300mb+ mp4/mov files, but they did when the SAME FILE was served via mod_jk from the app layer!
(Quickly checked mod_jk was up to date and no fixes for memory in newer version)
I could not explain yet why this webapp/mod_jk combination caused apache to hold onto the memory in its smap anon space, but now I could readily reproduce and observe the issue at will (and that is 99% of the battle)

Armed with this info, I started researching for apache memory directives and quickly found

MaxMemFree !!

After adding

MaxMemFree 10000

I repeated the test and did not see desired effect advertised by the documentation.
Then I read in the forums the units may be Mb instead of KB as documented.
I then tried:

MaxMemFree 10

Repeating my test I observed the httpd serving the 300Mb mp4 video file via mod_jk balloon while serving the request to > 200Mb, then quickly free up this memory and return to 2.3Mb!
Success (MaxMemFree FTW!)
Our Apache instances are now running much leaner and effectively we've increased our capacity and eliminated our exposure to random requests bloating our httpd memory consumption.