Wednesday, September 12, 2012

Non Disruptive 1-Chassis-to-2-Chassis Transition


Can you non-disruptively transition a clustered single chassis system into two chassis?  We included a 7.3.6=> 8.1.1 upgrade to try to take advantage of cf takeover –n, which is used when there is a version mismatch to force a takeover when the other controller halts. 

Here’s a timeline of what we tried (on a 3240 in the lab) along with the results:
1.  Upgrade B
a.  Update B to 8.1.1, fail over to A
b.  Move B to new chassis and connect interconnect cable
c.  Set B's boolean to false
d.  Cf giveback -f
2.  Upgrade A
a.  Update A to 8.1.1
b.  Cf takeover -n failed because the interconnect was determined to be down, so B couldn't see A halting*1
c.  A is halted at this point
d.  Cf takeover –f failed, because of the version mismatch*2
e.  Cf forcetakeover succeeded
f.  Set A's boolean  to false
g.  Cf giveback failed because the interconnect was determined to be down. *3
h.  Cf giveback -f succeeded.
3.  All appears stable, interconnect is up.
Notes:
*1 “Partner is not UP, NDU Takeover Terminated”
*2 “cf: takeover cannot be performed because of reason (interconnect error)”
*3 “cf monitor all” attached

What we found out:
There is a Boolean env variable that tells each controller whether it’s sharing the chassis with another controller, which is called a “CC” configuration (true = yes, CC config).  The cool thing about this variable is that ONTAP will automatically set it to the correct value in two cases:
  • 1.       Any time the system is in CC configuration, ONTAP will set the correct value itself (true).
  • 2.       Any time the system is in CI configuration (i.e. an IOXM is present), ONTAP will set the correct value itself (false).
  • 3.       For all other configurations, ONTAP will not change the value.


Conclusion: The upgrade/cf takeover -n didn't contribute.  There is still  a viable path for a non-disruptive plan, but it requires a precisely timed halt and cf forcetakeover, which isn’t without risk.  Action plan below:
Part 1:
  • Fail over to A
  • Move B to new chassis and connect interconnect
  • Set B's boolean to false
  • cf giveback -f

Part 2:
  • Halt A, cf forcetakeover as soon as A drops to LOADER prompt
  • Set A's boolean to false
  • Boot A. Interconnect should be up when node reaches 'Waiting for giveback'
  • cf giveback –f
  • cf should be enabled
Note:  There is also a Boolean env variable that fools the controller into thinking it is in “CI” configuration.  It’s an effective override, wasn't useful here.

No comments:

Post a Comment