Thursday, April 26, 2012

Meta Storage vMotion: Netapp Datamotion !

Our recent exploits into SSD, hybrid storage (startup vendor) trials have provided me a renewed appreciation for the solid, staid, well QA'ed  feature sets of the tier 1 storage vendors, in this case Netapp's DataMotion.

Datamotion is to the datastore as vMotion is to the VM.
Datamotion is "meta storage vMotion" and the benefits are analogous.

Our vSphere VMs run in vFiler NFS datastores on Netapp (Ontap 8.1) production cluster.  We replicate our production cluster to our campus standby cluster.  Just as vMotion allows for non-disruptive, zero downtime hypervisor upgrades, datamotion allows us to shift vFilers hosting dozens of VMs live, with zero downtime between physical clusters separated by campus/metro area distance.

We've employed datamotion for the last 3-4 cluster upgrades - "datamotion'ing" off all vFilers from the production cluster to the standby cluster, upgrading the evacuated cluster (Ontap versions, plus hardware, trays, flash cache SSD) - all with zero downtime.

In this age of storage startups - each with their own brand new filesystem which may or may not have your data integrity protected,  I have to commend the engineering acumen behind datamotion.  I have seen datamotions fail (< 5%) - but similar to when vMotions fail, this is prior to the cutover fail-safes and so its robust - no downtime - correct the previous issue and the next datamotion succeeds.

We recently installed Netapp SSD flash cache - in a follow up post I will use (unsupported online) datamotion between compare flash cache cluster heads to compare performance with and without the SSD.

Sunday, April 15, 2012

vExpert 2012

Its been a great privilege being associated with the vExpert 2011 class, whether meeting up at VMWorld, PEX, VMUGs or leveraging the great knowledge base this diverse group represents, the experience has been richly rewarding on many levels!

So I am very pleased to be part of the just announced 2012 vExpert group
Congratulations to those returning and new vExperts - if @sherrod's SVVMUG keynote last week was any indication, we have some very exciting times ahead!

Thursday, April 5, 2012

ESXi Reboot Causes ARP flood

We recently encountered a very strange issue where rebooting one of our ESXi 5.0 hosts would cause network disruption (NTI Enviromux devices would drop off the net, older switches would experience packet loss and vLAN reconfig flapping).

Eventually we determined with a packet capture the host was flooding the subnet with ARP requests related to one of its NFS datastores.  When we removed this datastore the host could reboot without causing the network disruption.  But we needed this datastore to satisfy HA requirement for 2 heartbeat datastores. 

The datastores are all provisioned from Netapp with the exports file listing each host's IP address for RW and ROOT access.  Upon closer inspection, the problem host's IP was listed only in the RW section of the Netapp exports file.  So, while the datastore would show up mounted, ESXi was not given full access it needed for HA heartbeat functionality, and the result was this flood of ARP.
Once the exports file was updated with IP in the ROOT section, the datastore remounted with full permissions and the host stopped exhibiting the problematic ARP, network disruption behavior.

We still have a case open with VMware to determine why ESXi 5.0 Update 1 behaves this way (allowing an NFS mount without the access required for full functionality, plus unleashing an ARP flood)

Tuesday, April 3, 2012

vCenter SQL Max Server Memory

We encountered an error on our vCenter 5 server today

SQL Server failed with error code 0xc0000000 to spawn a thread to process a new login or connection. Check the SQL Server error log and the Windows event logs for information about possible related problems.

Our vCenter VM is Windows 2008 (64 bit) with MS SQL 2008.  The VM is allocated 16Gb RAM - but I've noticed for a while no matter how much memory is allocated, the SQL Server process will eventually grow to fill the allocated memory.  And this error got me digging for a solution.

As it turns out, in SQL 2008, the memory allocation for SQL is dynamic and this growth of the SQL Server memory use is expected, UNLESS you set the "Maximum Server Memory"

This is done simply via the SSMS (SQL Server Management Studio):

Right Click Database -> Properties -> Memory

The change is effective immediately - in taskmgr I saw our SQL Server memory usage DROP from 13Gb to 10Gb.  By limiting the SQL memory, this will actually help optimize other vCenter processes and jobs (like the vCheck5 report - which has been taking > 3 hours lately?)  Tune in tomorrow to find out!