Friday, October 7, 2011

NetApp Experience: Bad Slot

Really very interesting things have happened lately. I had a shelf add that kicked out a ridiculous amount of errors for one disk on the new shelf:

disk.senseError:error]: Disk 2d.53: op 0x28:0000a3e8:0018 sector 0 SCSI:hardware error - (4 44 0 3)

diskown.RescanMessageFailed:warning]: Could not send rescan message to eg-naslowpc-h01. Please type disk show on the console for it to scan the newly inserted disks.

diskown.errorReadingOwnership:warning]: error 46 (disk condition triggered maintenance testing) while reading ownership on disk 2d.53

Disk 2d.53: op 0x28:0000a3f0:0008 sector 0 SCSI:hardware error - (4 44 0 3)
diskown.AutoAssignProblem:warning]: Auto-assign failed for disk 2d.53

The weird thing was that the messages just continued to loop rather than just fail the disk.  We swapped a new disk into that slot, and the old disk into a different slot to see if the disk was bad: turns out, the slot is bad.

We also tried reseating shelf Module B on that shelf.  NetApp Support informed me that "Module A handles communication to the even numbered disks by default, and Module B the odd disks."  I don't think this is true.

We're working with the customer to find a good resolution for this.  Since downtime is difficult to accomplish, we may try to swap out the shelf chassis while the system is running.  We'll see :-)

