Sunday, April 24, 2011

shrinking snapmirror destination volumes

When the initial snapmirror relationship is setup, we typically have source and destination volumes of equal size. But when the source volume is shrunk:

vol size appdata -200g

the destination remains at its original size. The problem is we now have unclaimed space that is not reported by the usual tools (df, system manager etc) - the tools report what the source volume reports.

Over time you can end up with snapmirror destination volumes too large and the space allocated during the initial snapmirror is effectively not available for new snapmirror destination volumes (or any other use) until you free it by shrinking the snapmirror destination volume (example volume is "vol6" below):

0) df -A (check initial Aggregate usage level)
aggr1 14036753204 12158892092 1877861112 87%

1) snapmirror break vol6
2) vol options vol6 fs_size_fixed off
3) df -A (verify the space is returned to the aggr for reuse)
aggr1 14036753204 11812443012 2224310192 84% But how do you get ONTAP to report the unused space without going through the procedure above?

One method is

vol status -b

this will report the volume size and filesystem size in 4k blocks - I added some awk to show the difference in gb:

[root@backup1 ~]# ssh netapp-01 -l root vol status -b | awk '{print $0" diff = "($3-$4)*4096/1024/1024" gb"}'
Volume Block Size (bytes) Vol Size (blocks) FS Size (blocks)
------ ------------------ ------------------ ----------------
vol0 4096 25952256 25952256
vm6 4096 684510413 590558004 diff = 367002 gb
sg1 4096 8912896 8650752 diff = 1024 gb
web2 4096 14417920 13107200 diff = 5120 gb
backup1 4096 5242880 3932160 diff = 5120 gb
data1 4096 15204352 14417920 diff = 3072 gb
vm2 4096 667942912 563714458 diff = 407142 gb
archive1 4096 418906112 402653184 diff = 63488 gb
ora6 4096 275251200 251658240 diff = 92160 gb
fcapdata 4096 18350080 14417920 diff = 15360 gb
apdata 4096 367001600 348966093 diff = 70451.2 gb
ora4 4096 216006656 183500800 diff = 126976 gb
backup2 4096 443862221 322122548 diff = 475546 gb


Now this takes the guesswork out of where the space is overallocated and I can use the diff numbers to shrink the snapmirror volumes

Update 3/16/15

snapmirror break vm6
vol options vm6 fs_size_fixed off # This returns the space to the aggregate
snapmirror resync vm65net


Wednesday, April 20, 2011

CPU resource shares bug

Folks in the VMware forums have long noticed this warning when attempting to move VMs into resource pools:

"The CPU resource shares for (vm you are trying to move) are much lower than the virtual machine's in the resource pool. With its current share setting, the virtual machine will be entitled to 2% of the CPU resources in the pool. Are you sure you want to do this?"


The workaround I found is to merely view (without changing anything) the Resouces settings for the VM: Right Click VM->Edit Settings...->Resources Tab -> OK

vCenter reports Reconfigure Virtual Machine (but we did not change anything!)

However, there are other variations of the message where this workaround does not resolve the message (eg "much higher" or memory instead of CPU resource) - this will be worth opening a case to resolve these other cases

Friday, April 15, 2011

vMotion Unicast Flood ESXi

Our vmware mgmt and vmotion nics share the same IP space.
This is not VMware best practice - they recommend vmkernel/vMotion traffic be isolated in its own IP space.

Problem:

While running ESX 4.1 we had no issues associated with the shared IP space, but once we started migrating to ESXi we noticed more and more network disruption during vMotion (especially bulk vMotions when migrating 10-20 VMs for ESXi maintenance). We noticed switches reporting high collisions and drops, and the Juniper firewall load would spike. Forum threads revealed others experiencing the same issue.

Solution:

The issue is resolved by adding a new vMotion vNIC in a private IP space.
This is a best practice recommendation I previously bel
ieved would require a network topology configuration design change with downtime.
But since the
vMotion traffic does not route outside the cluster vmware support was able to demonstrate its as simple as adding a new vmk (vMotion) vNIC dedicated to vMotion ) vmk3 below:

~ # esxcfg-route -l
VMkernel Routes:
Network Netmask
Gateway Interface
169.4.5.0 255.255.255.224 Local Subnet vmk2
17.5.5.0 255
.255.255.0 Local Subnet vmk0
default 0.0.0.0 17.5.5.1 vmk0
Add a new vmkernel port (in VI client: Select your ESXi host->Configuration Tab->Networking->vSwitch Properties->Add->VMkernel):



Choose a private IP space - it does not even need to match the existing default gateway setting since according to vmware support the vMotion traffic is not routed anyway. (I chose the convention of modifying the first octet of our existing IP to make it 10.x.y.z and updating our Networking records accordingly)
~ # esxcfg-route -l
VMkernel Routes:
Network Netmask Gateway Interface
169.4.5.0 255.255.255.224 Local Subnet vmk2
10.5.5.0 255.255.255.0 Local Subnet vmk3 <----new vMotion private net
17.5.5.0 255.255.255.0 Local Subnet vmk0
default 0.0.0.0 17.5.5.1 vmk0
~ #

Note: when you use the VI client to add a vMotion port, the previous vMotion port has its vMotion bit DISABLED (since only one vMotion port is allowed)

Made the change on 2 ESXi hosts, tested bulk vMotions did not cause the network disruptions anymore, then rolled out the changes to the rest of our cluster nodes including ESX hosts for consistency.

Conclusion:
This was a satisfying resolution all round in that our configuration was brought into line with best practices with zero downtime and the network disruption during vMotion issue was addressed at its root cause.

Wednesday, April 6, 2011

FusionIO ESXi PVSCSI VM Benchmarking

FusionIO recently (3/27/11) released their ESXi 4.1 drivers:

http://support.fusionio.com/driver/cross_vmware-esxi-drivers-block-iomemory-vsl_410.2.3.0.281.offline-bundle.zip

So, I took the opportunity to put the 600Gb ioDrive Duo through some VM benchmarks.
NOTE: If you get VIB Signature errors installing the driver like I did - see:
http://www.vmadmin.info/2010/09/vib-signature-missing.html

With the release of these drivers, there was finally native support for FusionIO datastores for ESXi.
(Previously folks were doing things like running Starwind to export the FusionIO over iSCSI)

Lab Config:
Dell 1950 Dualk Quad Core Intel E5440 2.83GHz with 16Gb RAM
ESXi 4.1 U1

Benchmark Results:

HD Tune:

Default LSI SAS VM SCSI controller: 1222 MB/sec



PVSCSI VM SCSI driver: 1368MB/sec



IOMeter:

Default LSI SAS VM SCSI Driver: 20721 IOPS @ 79% CPU



PVSCSI Driver: 21836 IOPS @ 33% CPU



Conclusion:

These FusionIO throughput and IOPs numbers are around 4 times better than the Netapp 3040 40 disk aggr numbers obtained in previous benchmarks.

(1368-1222)/1222 = 11.9% better throughput with PVSCSI
IOMeter Shows: 79-33 = 46% less CPU with PVSCSI