IT engineering and a little bit of hacking: NetApp Experience: Shelf ADD => Disk Fail => Failover

During a shelf add last week, I experienced as big of a system outage as I've ever encountered on NetApp equipment. We started seeing a few of these errors, which are normally spurious:

ses.exceptionShelfLog:info]: Retrieving Exception SES Shelf Log information on channel 0h ESH module A disk shelf ID 4.
ses.exceptionShelfLog:info]: Retrieving Exception SES Shelf Log information on channel 6b ESH module B disk shelf ID 5.

The first connection went smoothly, but when I unplugged the second connection from the existing loop, I started seeing some scary results. Here's the order of important messages:

NOTE: Currently 14 disks are unowned. Use 'disk show -n' for additional information.

fci.link.break:error]: Link break detected on Fibre Channel adapter 0h.

disk.senseError:error]: Disk 7b.32: op 0x2a:1bc91268:0100 sector 0 SCSI:aborted command - (b 47 1 4e)

raid.disk.maint.start:notice]: Disk /aggr3_thin/plex0/rg0/7b.32 Shelf 2 Bay 0 will be tested.

diskown.errorReadingOwnership:warning]: error 46 (disk condition triggered maintenance testing) while reading ownership on disk 7b.32

disk.failmsg:error]: Disk 7b.32 (JXWGA8UM): sense information: SCSI:aborted command(0x0b), ASC(0x47), ASCQ(0x01), FRU(0x00).

raid.rg.recons.missing:notice]: RAID group /aggr3_thin/plex0/rg0 is missing 1 disk(s).

Spare disk 0b.32 will be used to reconstruct one missing disk in RAID group /aggr3_thin/plex0/rg0.

raid.rg.recons.start:notice]: /aggr3_thin/plex0/rg0: starting reconstruction, using disk 0b.32

[disk.senseError:error]: Disk 7b.41: op 0x2a:190ca400:0100 sector 0 SCSI:aborted command - (b 47 1 4e)

diskown.errorReadingOwnership:warning]: error 46 (disk condition triggered maintenance testing) while reading ownership on disk 7b.41

[raid.disk.maint.start:notice]: Disk /aggr3_thin/plex0/rg1/7b.41 Shelf 2 Bay 9 will be tested
[disk.senseError:error]: Disk 7b.37: op 0x2a:190ca500:0100 sector 0 SCSI:aborted command - (b 47 1 4e)

raid.config.filesystem.disk.failed:error]: File system Disk /aggr3_thin/plex0/rg1/7b.37 Shelf 2 Bay 5 failed.

[disk.senseError:error]: Disk 7b.40: op 0x2a:190ca500:0100 sector 0 SCSI:aborted command - (b 47 1 4e)

raid.config.filesystem.disk.failed:error]: File system Disk /aggr3_thin/plex0/rg1/7b.40 Shelf 2 Bay 8 failed.

raid.vol.failed:CRITICAL]: Aggregate aggr3_thin: Failed due to multi-disk error

disk.failmsg:error]: Disk 7b.37 (JXWG6MLM): sense information: SCSI:aborted command(0x0b), ASC(0x47), ASCQ(0x01), FRU(0x00).

disk.failmsg:error]: Disk 7b.40 (JXWEEB3M): sense information: SCSI:aborted command(0x0b), ASC(0x47), ASCQ(0x01), FRU(0x00).

raid.disk.unload.done:info]: Unload of Disk 7b.37 Shelf 2 Bay 5 has completed successfully

raid.disk.unload.done:info]: Unload of Disk 7b.40 Shelf 2 Bay 8 has completed successfully

Waiting to be taken over. REBOOT in 17 seconds.

cf.fsm.takeover.mdp:ALERT]: Cluster monitor: takeover attempted after multi-disk failure on partner

Long story short, this system had caused numerous issues in the past, and we replaced both a dead disk and an ESH module. After that, the system stabilized: "Since the ESH module replacement there were no new loop or link breaks noticed in subsequent ASUPs."

IT engineering and a little bit of hacking

Pages

Tuesday, June 5, 2012

NetApp Experience: Shelf ADD => Disk Fail => Failover

1 comment: