Tuesday, June 5, 2012

NetApp Experience: Shelf ADD => Disk Fail => Failover

During a shelf add last week, I experienced as big of a system outage as I've ever encountered on NetApp equipment.  We started seeing a few of these errors, which are normally spurious:


ses.exceptionShelfLog:info]: Retrieving Exception SES Shelf Log information on channel 0h ESH module A disk shelf ID 4.
ses.exceptionShelfLog:info]: Retrieving Exception SES Shelf Log information on channel 6b ESH module B disk shelf ID 5.


The first connection went smoothly, but when I unplugged the second connection from the existing loop, I started seeing some scary results.  Here's the order of important messages:

NOTE: Currently 14 disks are unowned. Use 'disk show -n' for additional information.
fci.link.break:error]: Link break detected on Fibre Channel adapter 0h.

disk.senseError:error]: Disk 7b.32: op 0x2a:1bc91268:0100 sector 0 SCSI:aborted command -  (b 47 1 4e)
raid.disk.maint.start:notice]: Disk /aggr3_thin/plex0/rg0/7b.32 Shelf 2 Bay 0  will be tested.
diskown.errorReadingOwnership:warning]: error 46 (disk condition triggered maintenance testing) while reading ownership on disk 7b.32
disk.failmsg:error]: Disk 7b.32 (JXWGA8UM): sense information: SCSI:aborted command(0x0b), ASC(0x47), ASCQ(0x01), FRU(0x00).
raid.rg.recons.missing:notice]: RAID group /aggr3_thin/plex0/rg0 is missing 1 disk(s).
Spare disk 0b.32 will be used to reconstruct one missing disk in RAID group /aggr3_thin/plex0/rg0.
raid.rg.recons.start:notice]: /aggr3_thin/plex0/rg0: starting reconstruction, using disk 0b.32

[disk.senseError:error]: Disk 7b.41: op 0x2a:190ca400:0100 sector 0 SCSI:aborted command - (b 47 1 4e)


diskown.errorReadingOwnership:warning]: error 46 (disk condition triggered maintenance testing) while reading ownership on disk 7b.41
[raid.disk.maint.start:notice]: Disk /aggr3_thin/plex0/rg1/7b.41 Shelf 2 Bay 9  will be tested
[disk.senseError:error]: Disk 7b.37: op 0x2a:190ca500:0100 sector 0 SCSI:aborted command - (b 47 1 4e)


raid.config.filesystem.disk.failed:error]: File system Disk /aggr3_thin/plex0/rg1/7b.37 Shelf 2 Bay 5  failed.
[disk.senseError:error]: Disk 7b.40: op 0x2a:190ca500:0100 sector 0 SCSI:aborted command - (b 47 1 4e)
raid.config.filesystem.disk.failed:error]: File system Disk /aggr3_thin/plex0/rg1/7b.40 Shelf 2 Bay 8  failed.
raid.vol.failed:CRITICAL]: Aggregate aggr3_thin: Failed due to multi-disk error



disk.failmsg:error]: Disk 7b.37 (JXWG6MLM): sense information: SCSI:aborted command(0x0b), ASC(0x47), ASCQ(0x01), FRU(0x00).

disk.failmsg:error]: Disk 7b.40 (JXWEEB3M): sense information: SCSI:aborted command(0x0b), ASC(0x47), ASCQ(0x01), FRU(0x00).
raid.disk.unload.done:info]: Unload of Disk 7b.37 Shelf 2 Bay 5 has completed successfully
raid.disk.unload.done:info]: Unload of Disk 7b.40 Shelf 2 Bay 8  has completed successfully

Waiting to be taken over.  REBOOT in 17 seconds.

cf.fsm.takeover.mdp:ALERT]: Cluster monitor: takeover attempted after multi-disk failure on partner


Long story short, this system had caused numerous issues in the past, and we replaced both a dead disk and an ESH module.  After that, the system stabilized: "Since the ESH module replacement there were no new loop or link breaks noticed in subsequent ASUPs."

2 comments:

  1. Thanks For sharing very helpful information
    Multi Disk

    ReplyDelete
  2. Thank you for the information! You don’t have to worry about your internet problems. The solution to all of your problems can be found right over the phone. Dial TPG Customer Service Number, and make use of their satisfaction-guaranteed services.

    ReplyDelete