Friday, August 5, 2011

NetApp Training Brain Dump: Clusters


Doing my best to translate tech-speak into common sense, one day at a time.


In NetApp, a cluster is two controllers that are both capable of accessing any disk in the system. When data is sent to a particular controller to be written to disk, that data is sent to the local cache, and then mirrored to the partner controller's cache. The purpose of this is for failover: if one controller goes down, the other controller can 100% emulate the failed one and not miss a beat.

This is called a "takeover". If one partner "panics" (fails), the other controller will take over its disks, its IP addresses, and its traffic. Pretty cool. It knows what IP addresses to spoof because when you set it up, you put the partner addresses in the /etc/rc folder. You typically want no more than 50% utilization on either controller, so that in the case of a failover, the surviving controller can handle the total sum of traffic.

When you are confident the failed controller is back up and operational, you can initiate a "giveback," in which the controller coming back online will re-sync with its partner's cache, and then resume owning disks, handling traffic, and getting it's IP's back.  Givebacks take 1-2 minutes or so, during which the taken-over system is unavailable, and there are complications for people accessing files via CIFS.  The giveback command is issued from the partner that took over the down controller.

There are a number of options you can configure to handle this behavior. You can:

  • Alter how file sessions are terminated in CIFS before a giveback, including warning the user.
  • Delay/disable automatic givebacks.
  • Have ONTAP disable long-running operations before giveback.
  • Not allow the up controller to check the down controller before initiating giveback (bad idea).
  • Allow controllers to take each other over in case of hardware failure, and specify what IP/port to notify the partner on.
  • In a metrocluster, change FSID's on the partner's volumes and aggregates during a takeover.
  • Change how quickly an automatic takeover occurs if the partner is not responding. 
  • Disable automatic takeovers.
  • Allow automatic takeovers when a discrepancy in disk shelf numbers is discovered.
  • Allow automatic takeovers when a NIC or all NIC's fail.
  • Allow automatic takeovers on panic.
  • Allow automatic takeovers on partner reboot.
  • Allow automatic takeovers if the partner fails within 60s of being online.
The command used to initiate and control takeover/giveback is cf.  Here are your main options
  • cf disable: disables the clustering behavior.
  • cf enable: enable the clustering behavior.
  • cf takeover takes down the partner and spoofs it. cf giveback allows the down controller to take back its functionality.  ONTAP won't allow these to be initiated if the system fails checks for whether the action would be disruptive or cause data loss.  
  • cf giveback -f: bypasses ONTAP's first level of checks as long as data corruption and filer error aren't a possibility.
  • cf takeover -f: allows the takeover even if it will abort a coredump on the partners.  Add a -n to the command: ignores whether the partner has a compatible version of ONTAP.
  • cf forcegiveback: ignores whether it is safe to do a giveback. Can result in data loss. 
  • cf forcetakeover: ignores whether it is safe to do a takeover.  Can result in data loss. -d bypasses all of ONTAP's checks and initiates the takeover regardless of data loss or disruption. -f also bypasses the prompt for confirmation.
  • cf status will inform you of the current state of the clustered relationship.

No comments:

Post a Comment