IT engineering and a little bit of hacking: NetApp Experience: Shelf Add => Disk Bypass

During a shelf add last night, we ran into another hairy situation. Turns out we didn't connect the new shelf smoothly enough when sliding the SFP into the port, which caused us to see this:

[Filer: esh.bypass.err.disk:error]: Disk 7b.50 on channels 7b/0d disk shelf ID 3 ESH A bay 2 Bypassed due to excessive port oscillations

[Filer: ses.drive.missingFromLoopMap:CRITICAL]: On adapter 0d, the following device(s) have not taken expected addresses: 0d.57 (shelf 3 bay 9), 0d.58 (shelf 3 bay 10), 0d.59 (shelf 3 bay 11), 0d.61 (shelf 3 bay 13), 0d.67 (shelf 4 bay 3), 0d.70 (shelf 4 bay 6), 0d.74 (shelf 4 bay 10), 0d.75 (shelf 4 bay 11)

[Filer: shm.bypassed.disk.fail.disabled:error]: shm: Disk bypass check has been disabled due to multiple bypassed disks on host bus adapter 0d, shelf 3.

[Filer: shm.bypassed.disk.fail.disabled:error]: shm: Disk bypass check has been disabled due to multiple bypassed disks on host bus adapter 0d, shelf 4.

[Filer: ses.exceptionShelfLog:info]: Retrieving Exception SES Shelf Log information on channel 0d ESH module A disk shelf ID 3.

In fcadmin device_map, this looked like this:

Loop 0d

Loop 7b

Note that each loop saw a different number of bypassed disks. Sysconfig -r, disk show -n, and vol status -f all came back normal. A little backstory here: ONTAP bypasses disks because in certain scenarios, a single disk can lock up an entire FC loop (read here for more info on this). This is not the same thing as failing a disk: there are various situations where the disk will just be ignored by the system.

This thread indicated to us that the fix would likely be slowly reseating the disks. You have to respect the filer: pulling and pushing a ton of disks consecutively may cause unexpected consequences, so wait at least 90s in between each action. Pull, 90s, reseat, 90s, pull another, etc.

An example of unexpected consequences is below: one disk reacted poorly to being re-seated, and failed. When the system re-scanned the disk after it was pushed in, we saw this:
[Filer: disk.init.failureBytes:error]: Disk 0d.70 failed due to failure byte setting
We attempted to reseat the disk again, to the same effect. The disk didn't show up in vol status -f. We also tried to unfail the disk, to no effect. Here's how we fixed it, with 90s in between each step:
1) pull failed disk
2) pull another disk
3) swap failed disk into other slot
4) swap other disk into failed disk's slot
5) disk show
6) priv set advanced
7) disk unfail 7b.71 (this is the slot the failed disk was in).

IT engineering and a little bit of hacking

Pages

Thursday, October 27, 2011

NetApp Experience: Shelf Add => Disk Bypass

No comments:

Post a Comment