Thursday, March 4, 2010

EVAPerf Statistical Limitations

Here's a big issue I wrestled with over the last few weeks:

EVAPerf occassionally hiccups, kicking out a single data point that claims a 2Gb host port has 30Gb/s throughput, or that a single diskgroup has 200GB/s being written to it. Sometimes these are clear overflow numbers (214748.3647 showed up repeatedly) and sometimes they are just absurdly high, though unique and precise. HP recommends using 95th percentile to statistically analyze the performance of your EVA's, and these super-high numbers skew our statistics to the point of being worthless.

My solution: create duplicate, empty SQL tables and daily screen the data to move any data points over thresholds I set to those duplicate tables, where they'd be out of the scope of my automated reporting. The trouble is, where's the threshold?

After a few weeks of emailing back and forth with HP, it became clear that they weren't interested in giving any definite answer or OK'ing my calculations. Part of the reason for that is that they have no real insight into how much actual IO CAN occur in a disk group because of all the behind-the-scenes leveling, parity calculation, and striping that occurs on top of the server-generated IO.

On top of that, block size varies so widely that the IO capacity calculations that they ARE able to do give you no real concept of the throughput capacity of your hardware. For example, let's say their PerfMonkey tool said your disk configuration allowed for 6000 read and 4000 write IO/sec in your disk group. Theoretically, with block sizes of up to 64MB apiece, this means your throughput could be over 380GB/s. So we're without a solid mathematical recourse.

I settled on 50GB/s for disk groups and LUNs, and 10Gb/s for host ports (even though they're only 4Gb ports) after careful analysis of how that affected the data - ends up looking like about 40 data points per month would be moved using those as thresholds.

Working pretty nicely so far.

HP EVA fnames.conf

Our SAN environment has 120TB spread over 4 HP EVA's (3 8100's and a 4400). We've worked through numerous difficulties with these, not the least of which was the dreaded "Saturday morning slowness." Part of our efforts to combat this was to attempt to gain greater insight into where the IO was actually coming from - at the time, we simply didn't have the system in place to do this.

Our friends at HP provide EVAPerf, which kicks out CSV files with a deluge of data (200MB+ per day per array). In order to make sense of this, a good friend on the software engineering side was added as a resource: he did a great job of writing a loading program that took those CSV files and kicking them in SQL. Our company is looking into IP rights, it's that awesome.

Meanwhile, we discovered a few months ago that our CSV's were filled with WWN's, which are pretty cumbersome to work with. HP's solution is the friendly names file fnames.conf, which just allows the EVAPerf task on your server to replace the WWN's with readable, English names. So a fnames.conf file was set up to automatically recreate itself once a day.

Well, as time went on and we changed more and more, the data I was working with was increasingly filled with WWN's. Investigation Monday morning turned up this little gem:
"The fnames.conf file must reside in the directory in which HP Command View EVAPerf was installed. "

Well, whoever set this up had it creating the updated file in a c:\utilities subdirectory - we were working with a months-old fnames file, since EVAPerf was installed under c:\programs\hp... After updating this, we went from 15/150 correctly represented LUNS in one EVA's csv file to 135/150 LUNS showing up with real names, greatly simplifying our vdisk I/O statistics. Woot!

Extremely useful information on decoding EVAPerf data:
http://www.fcoe.ru/index.php?option=com_content&task=view&id=257&Itemid=46#addcomments

See here for more official info on the fnames/evaperf integration
http://h10032.www1.hp.com/ctg/Manual/c00605846.pdf

Resurrection!

I had hoped I'd be able to continue to post solutions to tough problems and general summaries of things I'd learned, but only got 5 posts in before other priorities took hold. A lighter school load this quarter will hopefully give me a better chance to keep this updated!