Just a performance case I worked recently. These kinds of things can be instructive for people with similar problems, or those who learn by playing along at home. Here's my email, sorry for the copy paste without context.
Please take a look at the
email chain below as the starting point for this conversation. At this
point, we’ve reviewed four sets of performance data gathered over the last two
months and have closely correlated a spike in large-IOP-size traffic to our latency
spikes. This spike is both in number and size of IOPS, exceeding 32,000
IOPS for 15+ minutes at a time. There is no single volume driving the
traffic, as it appears to be increasing dramatically across the board.
Here is a summary of the
performance data from Thursday 4/3, please note the IOP ramp up and associated
latency:
Start Time
|
CPU Busy
|
NFS Op/s
|
Read Op/s
|
Read Lat (ms)
|
Write Op/s
|
Write Lat (ms)
|
Net Sent (MB/s)
|
Net Recv (MB/s)
|
9:53p
|
52
|
7,127
|
1,256
|
2.32
|
5,731
|
0.63
|
27
|
50
|
9:57p
|
67
|
13,637
|
5,396
|
19.96
|
8,076
|
110.09
|
145
|
81
|
10:05p
|
99
|
31,119
|
17,272
|
22.91
|
13,697
|
341.18
|
371
|
205
|
10:13p
|
99
|
32,311
|
22,519
|
8.4
|
9,739
|
229.45
|
621
|
200
|
10:24p
|
99
|
23183
|
12,819
|
7.21
|
10,261
|
143.16
|
348
|
260
|
And here is the data from
Thursday 2/13.
Period
|
CPU Busy
|
NFS Op/s
|
Read Op/s
|
Read Lat (ms)
|
Write Op/s
|
Write Lat (ms)
|
Net Sent (MB/s)
|
Net Recv (MB/s)
|
9:09p
|
78
|
12,574
|
5,562
|
3.19
|
6,926
|
1.05
|
200
|
141
|
9:18p
|
98
|
27,771
|
17,291
|
4.7
|
10,347
|
24.29
|
571
|
312
|
9:29p
|
98
|
33,050
|
21,460
|
9.11
|
11,507
|
125.85
|
650
|
352
|
9:38p
|
98
|
34,149
|
22,813
|
11.28
|
11,216
|
530.9
|
647
|
345
|
One thing that stands out in the
data is a large, sudden increase in 64k+ IOPS. I’ve adjusted the table to include a row for 64k IOPS and have highlighted the
relevant statistic.
FAS6280 Maximum IOPS
|
|
Read/Write Mix
|
Avg IO Size
|
100/0
|
75/25
|
50/50
|
25/75
|
0/100
|
64k
|
61,000
|
43,000
|
32,000
|
26,000
|
22,000
|
32k
|
68,000
|
48,000
|
36,500
|
30,000
|
25,000
|
24k
|
74,000
|
51,000
|
39,500
|
31,500
|
27,000
|
16k
|
80,000
|
56,500
|
43,000
|
36,500
|
30,500
|
8k
|
85,000
|
63,000
|
50,000
|
41,500
|
36,000
|
4k
|
90,000
|
66,000
|
54,000
|
45,000
|
40,000
|
The workload mix appears
fine for most of the day but experiences large-IOP-size peaks that are outside
our guidelines and cause some pain (38,000 IOPS 1pm 4/5, 45,000 IOPS 11pm 4/4, 37,000 IOPS 11am 4/3) . I’d
also make mention that ~10% of IO to this system is misaligned, which hinders
us from achieving maximum performance ROI. Lastly, this system is
achieving 65-85% dedupe ratios, which is fantastic space conservation but adds
to the overall workload.
As discussed yesterday,
here are our options:
·
Short term steps:
o Stagger workloads (Symantec, et al)
o Disable aggr snapshots (done)
o Stagger dedupe
o Case open on daytime snapshot correlated latency (done)
o Update Data ONTAP
·
Long term solutions:
o Add new disk to passive controller and balance workload or
o Shift workload to a different or new HA pair