Friday, September 9, 2011

NetApp Insights: NDR Shelf Reboot

Got to witness a NetApp expert at work yesterday as he did some tests on a pretty cool capability that I hadn't heard of before.  In ONTAP 7.3.2, DS14 shelves, in certain hardware configurations (read the KB below), allow you to suspend IO to the shelf for a certain period of time so it can be rebooted without the system panic'ing.

The basic idea is this: normally, if a shelf disappears off the loop, the system would catch the error and panic.  In this case, the system goes into a mode where it tolerates this for a certain period of time through a combo of queue'ing or suspending traffic to the affected disks.  In practice, you will see affected volumes suspend traffic for a short period of time.  After the shelf reboot is complete, entering the power_cycle shelf completed command takes the system out of that mode and returns it to normal error catching.

For certain configurations the shelf will actually reboot automatically, and for older hardware/software combos the system will give you 60 minutes to manually shut off the power to the shelf.  The specs say to expect up to a 60s suspension in traffic: in our tests, the automatic reboot took 11s and the manual one took up to 45s.

Here's an example of the command that reboots shelf 3 on loop 6a: storage power_cycle shelf start -f -c 6a -s 3
And here's the syntax:

power_cycle shelf -h
power_cycle shelf start [-f] -c [-s
power_cycle shelf completed 

Attempts to power-cycle a selected shelf in a channel or all the shelves in a channel.

'power_cycle shelf completed' command must be used, as and when instructed by the 'power_cycle shelf start' command.
-f    do not ask for shelf power-cycle confirmation 
-c    if option -s is not specified power-cycle all shelves connected to the specified channel. if option -s is specified, power-cycle shelf id on specified channel.
-h    display this help and exit.

One idea we tested in the lab was using this to change the shelf ID while the system is still online.  The shelf we rebooted had the mailbox disks on it, which caused a panic and a failover.  This may still be possible in some conditions, I'll update as we figure this out.

Some ideas I want to test out:
  • What happens if a shelf other than the one you specified in the command goes offline?  Is the system targeted in its tolerance of shelf loss, or does its tolerance extend across all shelves?
  • For setting the shelf ID:
    • What difference does it make if none of the disks are owned/are spares?
    • What if none of the disks in the shelf are mailbox disks?,,SSL+index?page=content&id=3012745

Note: This capability is not available for DS4243's.

No comments:

Post a Comment