Tuesday, June 5, 2012

NetApp Experience: Shelf ADD => Disk Fail => Failover

During a shelf add last week, I experienced as big of a system outage as I've ever encountered on NetApp equipment.  We started seeing a few of these errors, which are normally spurious:


ses.exceptionShelfLog:info]: Retrieving Exception SES Shelf Log information on channel 0h ESH module A disk shelf ID 4.
ses.exceptionShelfLog:info]: Retrieving Exception SES Shelf Log information on channel 6b ESH module B disk shelf ID 5.


The first connection went smoothly, but when I unplugged the second connection from the existing loop, I started seeing some scary results.  Here's the order of important messages:

NOTE: Currently 14 disks are unowned. Use 'disk show -n' for additional information.
fci.link.break:error]: Link break detected on Fibre Channel adapter 0h.

disk.senseError:error]: Disk 7b.32: op 0x2a:1bc91268:0100 sector 0 SCSI:aborted command -  (b 47 1 4e)
raid.disk.maint.start:notice]: Disk /aggr3_thin/plex0/rg0/7b.32 Shelf 2 Bay 0  will be tested.
diskown.errorReadingOwnership:warning]: error 46 (disk condition triggered maintenance testing) while reading ownership on disk 7b.32
disk.failmsg:error]: Disk 7b.32 (JXWGA8UM): sense information: SCSI:aborted command(0x0b), ASC(0x47), ASCQ(0x01), FRU(0x00).
raid.rg.recons.missing:notice]: RAID group /aggr3_thin/plex0/rg0 is missing 1 disk(s).
Spare disk 0b.32 will be used to reconstruct one missing disk in RAID group /aggr3_thin/plex0/rg0.
raid.rg.recons.start:notice]: /aggr3_thin/plex0/rg0: starting reconstruction, using disk 0b.32

[disk.senseError:error]: Disk 7b.41: op 0x2a:190ca400:0100 sector 0 SCSI:aborted command - (b 47 1 4e)


diskown.errorReadingOwnership:warning]: error 46 (disk condition triggered maintenance testing) while reading ownership on disk 7b.41
[raid.disk.maint.start:notice]: Disk /aggr3_thin/plex0/rg1/7b.41 Shelf 2 Bay 9  will be tested
[disk.senseError:error]: Disk 7b.37: op 0x2a:190ca500:0100 sector 0 SCSI:aborted command - (b 47 1 4e)


raid.config.filesystem.disk.failed:error]: File system Disk /aggr3_thin/plex0/rg1/7b.37 Shelf 2 Bay 5  failed.
[disk.senseError:error]: Disk 7b.40: op 0x2a:190ca500:0100 sector 0 SCSI:aborted command - (b 47 1 4e)
raid.config.filesystem.disk.failed:error]: File system Disk /aggr3_thin/plex0/rg1/7b.40 Shelf 2 Bay 8  failed.
raid.vol.failed:CRITICAL]: Aggregate aggr3_thin: Failed due to multi-disk error



disk.failmsg:error]: Disk 7b.37 (JXWG6MLM): sense information: SCSI:aborted command(0x0b), ASC(0x47), ASCQ(0x01), FRU(0x00).

disk.failmsg:error]: Disk 7b.40 (JXWEEB3M): sense information: SCSI:aborted command(0x0b), ASC(0x47), ASCQ(0x01), FRU(0x00).
raid.disk.unload.done:info]: Unload of Disk 7b.37 Shelf 2 Bay 5 has completed successfully
raid.disk.unload.done:info]: Unload of Disk 7b.40 Shelf 2 Bay 8  has completed successfully

Waiting to be taken over.  REBOOT in 17 seconds.

cf.fsm.takeover.mdp:ALERT]: Cluster monitor: takeover attempted after multi-disk failure on partner


Long story short, this system had caused numerous issues in the past, and we replaced both a dead disk and an ESH module.  After that, the system stabilized: "Since the ESH module replacement there were no new loop or link breaks noticed in subsequent ASUPs."

1 comment: