Tuesday, February 8, 2011

Daily VMware Slowness

I was in a meeting Friday when someone mentioned that our VM's (of which we have hundreds spread across 4 datacenters and 12 ESX servers) were experiencing degraded performance every day around lunchtime.  I set up PerfMon tasks to watch the CPU utilization and I/O writes and reads of the various processes on  two different VM's.  What I found was: nothing.  There was no data to support the slowness. We then started experiencing slowness at 5pm: I analyzed that data, and found this:

Virtual Machine 1, 4:50pm to 6pm: 

Virtual Machine 2, 4:50pm to 6pm:


If I were a detective, I would probably call that a clue.  Turns out our engineer who owns McAfee for us had recently made some changes, which I haven't been able to dig into, that made all of our VM's update their local virus definitions simultaneously.  When there was conversation about that possibly causing performance issues (a theory that was disparaged by some), he moved it to 5pm.  Mystery solved.  We're going to work to stagger the DAT updates, so not all our VMware CPU and LUN I/O resources are pegged simultaneously.

Couple notes on perfmon:
- Kick logs out to CSV or TSV.  Either can be imported into excel, which is much easier and more versatile to work with than PerfMon's own data analyzer.
- Make sure to select "All instances" so that each process gets its own set of data points.
- I took readings every 10 seconds and saw no real performance impact over the course of the day. 
- I highly recommend deleting "Idle" and "Total" processes first thing when analyzing the data you've collected, or better yet not recording data from them at all: they're just noise.  
- I recommend setting a 1-4 hour time frame for the data collection so the data is broken out in multiple files: nothing is more hassle than trying to work with at 200MB text file.  Trust me, learned that lesson a year ago (thanks HP EVAStats!).  


No comments:

Post a Comment