Friday, May 6, 2011

NetApp Experience: Controller Panic

Was shadowing a shelf add recently and got to observe a pretty hairy situation.  Here's the rundown:

11:00pm
  1. A DS14mk4 shelf was added to a production HA FAS6080 running ONTAP 7.3.3.  The shelf was intended to be shelf 2 in the loop, but the shelf ID was still set to 1 when it was added.
  2. Panic and Failover occurred from the controller who owned all the disks on that shelf. 
  3. New shelf ID is set to 2, the correct ID.
  4. The partner node assigned soft ID's to the new disks.
  5. The partner did not recognize all of the real shelf 1's disks, and began rebuilding.
  6. As many as 8 disks began rebuilding in bay 27, 28, or 29 of several loops.  Seems like this client keeps their spares in the last couple bays of the second disk shelf per loop, or the first bay in the third shelf.
Errors generated by adding the shelf with a wrong shelf ID, in chronological order:

fci.device.invalidate.soft.address   adapterName="0a" deviceName="0a.0 (0x04000000)" hardLoopId="17" 
scsi.cmd.selectionTimeout              deviceType="Disk" deviceName="0a.17" 
disk.ioFailed                                     deviceName="0a.17" 
scsi.cmd.noMorePaths                    deviceType="Disk" deviceName="0a.22" 
scsi.cmd.noMorePaths                    deviceType="Disk" deviceName="0a.23" 


03:00am:  Disks completed rebuilding.  8 disks on real shelf 1 still not being recognized.
04:00am:  FSE arrives onsite.
04:30am:  20 minute outage action plan developed:
  • Shut down all systems accessing the data.
  • Disable protocols.
  • Halt both controllers (take them offline).
  • Reboot disk shelf 1 and 2.
  • Boot up controllers.
06:00am:  No action taken.  Customer and NetApp decided to let the system stay stable into production hours and address it the next night.

10:00pm:  Action plan started (shut off systems accessing the data, etc).
10:19pm:  Both controllers shut down.
10:28pm:  Both controllers up and functioning normally.

Notes:
  1. No outage occurred until the controlled failback.
  2. No data loss occurred.
  3. We have no insight into the effect on performance.
Take aways:
  1. Having lots of spares can pay off.
  2. Make sure there are no more than two disks per RG on any one shelf.
  3. Human error is much more likely than mechanical failure or software bug to cause a major disruption.

No comments:

Post a Comment