Thursday, April 24, 2014

Performance Case

Just a performance case I worked recently.  These kinds of things can be instructive for people with similar problems, or those who learn by playing along at home.  Here's my email, sorry for the copy paste without context.

 Please take a look at the email chain below as the starting point for this conversation.  At this point, we’ve reviewed four sets of performance data gathered over the last two months and have closely correlated a spike in large-IOP-size traffic to our latency spikes.  This spike is both in number and size of IOPS, exceeding 32,000 IOPS for 15+ minutes at a time.  There is no single volume driving the traffic, as it appears to be increasing dramatically across the board.  

Here is a summary of the performance data from Thursday 4/3, please note the IOP ramp up and associated latency:
Start Time
CPU Busy
NFS Op/s
Read Op/s
Read Lat (ms)
Write Op/s
Write Lat (ms)
Net Sent (MB/s)
Net Recv (MB/s)
9:53p
52
7,127
1,256
2.32
5,731
0.63
27
50
9:57p
67
13,637
5,396
19.96
8,076
110.09
145
81
10:05p
99
31,119
17,272
22.91
13,697
341.18
371
205
10:13p
99
32,311
22,519
8.4
9,739
229.45
621
200
10:24p
99
23183
12,819
7.21
10,261
143.16
348
260

And here is the data from Thursday 2/13.
Period
CPU Busy
NFS Op/s
Read Op/s
Read Lat (ms)
Write Op/s
Write Lat (ms)
Net Sent (MB/s)
Net Recv (MB/s)
9:09p
78
12,574
5,562
3.19
6,926
1.05
200
141
9:18p
98
27,771
17,291
4.7
10,347
24.29
571
312
9:29p
98
33,050
21,460
9.11
11,507
125.85
650
352
9:38p
98
34,149
22,813
11.28
11,216
530.9
647
345

One thing that stands out in the data is a large, sudden increase in 64k+ IOPS.  I’ve adjusted the table to include a row for 64k IOPS and have highlighted the relevant statistic.
FAS6280 Maximum IOPS

Read/Write Mix
Avg IO Size
100/0
75/25
50/50
25/75
0/100
64k
61,000
43,000
32,000
26,000
22,000
32k
68,000
48,000
36,500
30,000
25,000
24k
74,000
51,000
39,500
31,500
27,000
16k
80,000
56,500
43,000
36,500
30,500
8k
85,000
63,000
50,000
41,500
36,000
4k
90,000
66,000
54,000
45,000
40,000


  The workload mix appears fine for most of the day but experiences large-IOP-size peaks that are outside our guidelines and cause some pain (38,000 IOPS 1pm 4/5,  45,000 IOPS 11pm 4/4, 37,000 IOPS 11am 4/3) .   I’d also make mention that ~10% of IO to this system is misaligned, which hinders us from achieving maximum performance ROI.  Lastly, this system is achieving 65-85% dedupe ratios, which is fantastic space conservation but adds to the overall workload.

  As discussed yesterday, here are our options:
·         Short term steps:
o   Stagger workloads (Symantec, et al)
o   Disable aggr snapshots (done)
o   Stagger dedupe
o   Case open on daytime snapshot correlated latency (done)
o   Update Data ONTAP
·         Long term solutions:
o   Add new disk to passive controller and balance workload or

o   Shift workload to a different or new HA pair

No comments:

Post a Comment