Thursday, March 4, 2010

EVAPerf Statistical Limitations

Here's a big issue I wrestled with over the last few weeks:

EVAPerf occassionally hiccups, kicking out a single data point that claims a 2Gb host port has 30Gb/s throughput, or that a single diskgroup has 200GB/s being written to it. Sometimes these are clear overflow numbers (214748.3647 showed up repeatedly) and sometimes they are just absurdly high, though unique and precise. HP recommends using 95th percentile to statistically analyze the performance of your EVA's, and these super-high numbers skew our statistics to the point of being worthless.

My solution: create duplicate, empty SQL tables and daily screen the data to move any data points over thresholds I set to those duplicate tables, where they'd be out of the scope of my automated reporting. The trouble is, where's the threshold?

After a few weeks of emailing back and forth with HP, it became clear that they weren't interested in giving any definite answer or OK'ing my calculations. Part of the reason for that is that they have no real insight into how much actual IO CAN occur in a disk group because of all the behind-the-scenes leveling, parity calculation, and striping that occurs on top of the server-generated IO.

On top of that, block size varies so widely that the IO capacity calculations that they ARE able to do give you no real concept of the throughput capacity of your hardware. For example, let's say their PerfMonkey tool said your disk configuration allowed for 6000 read and 4000 write IO/sec in your disk group. Theoretically, with block sizes of up to 64MB apiece, this means your throughput could be over 380GB/s. So we're without a solid mathematical recourse.

I settled on 50GB/s for disk groups and LUNs, and 10Gb/s for host ports (even though they're only 4Gb ports) after careful analysis of how that affected the data - ends up looking like about 40 data points per month would be moved using those as thresholds.

Working pretty nicely so far.

1 comment: