Thursday, September 29, 2011

NetApp Insights: Shelf Shutdown

Now that we know we can perform a shelf reboot live, we got a bit adventurous.

The question we were trying to answer is "Could we replace/remove a shelf on a live system without causing downtime?"  I used a 3160 cluster in the lab with 4 DS14s in a loop, slowly failed all the disks in shelf 3, and removed ownership on those disks.  At that point, I could shut down/unplug that shelf at will, and neither system complained except noting that they were transitioning to single-path.

I doubt NGS will ever give the plan their full blessing, but it's good to know that it's ok from a technical standpoint. 

Update 1: I also successfully swapped out a shelf chassis in this manner in the lab.  The controllers were totally ok with a new serial number!  No issues that I could find.

Update 2: NGS did in fact OK this action plan twice, but later completely backed out.  There's concern that the system will keep the shelf registered in the OS somewhere.  A possible solution for this is the perform a failover/give back for each node after the shelf removal, since failover/giveback includes a reboot. 

