IT engineering and a little bit of hacking: NetApp Experience: Shelf Add => Disk Fail

One of my practices when performing a shelf add is to wait in between each step, specifically between unplugging and re-connecting any cables. My thought process on this has been that the system should be allowed time to settle to its new circumstance, specifically that the controller will need to recognize what paths it is now able to communicate to the disks on.

Digging deeper, one thing I recently learned is that the disk has two ports of communication (referred to as A and B) to the shelf modules, and they negotiate their paths from the disk to the shelf module to the fiber ports on controller. e.g., Disk 23 port A could be connecting through shelf module B to port 2c, and disk 23 port B could be connecting through shelf module A to port 1b.

All of that is important to understanding the serious issue that failed two disks in a production cluster recently. A single shelf (DS14mk2 750GB SATA) was connected MPHA to a clustered pair with this configuration:

disk 1d.18: disk port A to shelf module B to port 1d
disk 2b.22: disk port B to shelf module A to port 2b

After unplugging the cable from 1d to shelf module B, there was a 17 second delay and then this:

Cluster Notification mail sent: Cluster Notification from CONTROLLER (DISK CONFIGURATION ERROR) WARNING
Controller> scsi.path.excessiveErrors:error]: Excessive errors encountered by adapter 2b on disk device 2b.18.
Controller> scsi.cmd.transportError:error]: Disk device 2b.22: Transport error during execution of command: HA status 0x9: cdb 0x28:354a07b8:0048.
Controller> raid.config.filesystem.disk.not.responding:error]: File system Disk /aggr2/plex0/rg0/2b.22 Shelf 1 Bay 6 [NETAPP X268] is not responding.
Controller> scsi.cmd.transportError:error]: Disk device 2b.18: Transport error during execution of command: HA status 0x9: cdb 0x28:4f5e9748:0048.
Controller> disk.failmsg:error]: Disk 2b.22: command timed out.
Controller> raid.rg.recons.missing:notice]: RAID group /aggr2/plex0/rg0 is missing 1 disk(s).
Controller> raid.rg.recons.info:notice]: Spare disk 2b.27 will be used to reconstruct one missing disk in RAID group /aggr2/plex0/rg0.

Controller> diskown.errorReadingOwnership:warning]: error 23 (adapter error prevents command from being sent to device) while reading ownership on disk 2b.18

Analysis:
These two disks failed as a result of an HBA issue last night. When a path is disconnected, any disks that are owned over that path are engineered to use the redundant path. When we disconnected port 1d, the HBA in slot 2 produced errors that halted this normal renegotiation for two of the disks. Because the disks were not able to operate on the redundant path, the system considered the disks to be in a failed state and began reconstructing that data to spare disks. When this happened, we halted work to investigate and remediate. We'll probably just RMA the disks to reduce variables.

NetApp Support recommendation: Re-seat this HBA, which would require a failover/giveback to perform. Another option would be to replace the HBA (which is what we'll probably do).

Update: NGS (NetApp Support) has changed their minds and now think this is a disk firmware issue. This disk firmware was backrev'd a couple years still, and their explanation is that iSCSI errors caused by the firmware pile up over time and eventually cross a threshold and cause an HBA to consider that disk incommunicable. There's no warning on the system that this HBA can't talk to that disk, and all the traffic is routed through the redundant path.

In this case, we had two disks that were in this situation and when I unplugged the return path (the path they were active on) they tried to fail over to the other path and could not. NGS believes this was just a pure chance, struck by lightning situation.

I'll post the bug report on this soon: the gist of it is that between 40C and 50C, a latching mechanism can get "stuck" and error out, but will quickly recover. I'm skeptical of this because the highest temperature observed in this shelf was 36C.

2nd Update: As best as I can tell, the disk firmware update did the trick. We went through with shelf adds last night without seeing the same behavior. We did, however, see what we believe to be a separate issue.

IT engineering and a little bit of hacking

Pages

Thursday, September 22, 2011

NetApp Experience: Shelf Add => Disk Fail

No comments:

Post a Comment