Wednesday, January 22, 2014

Adventures with CDOT

Had an interesting situation after an option 4 on a new CDOT system (3250’s), this is a bit long but I wanted to get it all onto paper and into our tribal knowledge.  While one controller was still clearing, I started configuring the other one (set HA true, set to switchless, update from 82p3 to 82p5).  Then I updated the other controller.  When the other system tried to join the cluster, I saw this:

Error: Node "cluster-04" on ring "Management" is offline. Check the health of the cluster using the "cluster show" command. For further assistance, contact support personnel.

Well, ok.  A little checking:
cluster::> cluster show
Node                  Health  Eligibility
--------------------- ------- ------------
cluster-01        true    true
cluster-04        true    true
Warning: Cluster HA has not been configured. Cluster HA must be configured on a
         two-node cluster to ensure data access availability in the event of
         storage failover. Use the "cluster ha modify -configured true" command
         to configure cluster HA.
2 entries were displayed.

cluster::> node show
Node      Health Eligibility Uptime        Model       Owner    Location 
--------- ------ ----------- ------------- ----------- -------- ---------------
cluster-01
          false  true         00:26:23.001 FAS3250              Minneapolis DR Site
cluster-04
          false  true         00:09:39.043 FAS3250
Warning: Cluster HA has not been configured. Cluster HA must be configured on a
         two-node cluster to ensure data access availability in the event of
         storage failover. Use the "cluster ha modify -configured true" command
         to configure cluster HA.
2 entries were displayed.

Well, then let’s modify cluster HA.
cluster::> cluster ha modify -configured true

Warning: High Availability (HA) configuration for cluster services requires
         that both SFO storage failover and SFO auto-giveback be enabled. These
         actions will be performed if necessary.
Do you want to continue? {y|n}: y
Error: command failed: Not enough online nodes in the cluster:
       SL_REMOVE_EPSILON_OOQ_ERROR (code 129)
       There are too few healthy nodes in the cluster to allow join of
       additional nodes. Ensure that the nodes are operational and re-issue the
       command. Use the "cluster show" command on a node in the target cluster
       to view the state of the cluster.

Well sheesh.  So I reboot and what happens?  A takeover.
Jan 22 14:53:28 [msp-cluster-04:callhome.sfo.takeover:CRITICAL]: Call home for CONTROLLER TAKEOVER COMPLETE AUTOMATIC
Jan 22 14:53:28 [msp-cluster-04:callhome.reboot.takeover:error]: Call home for PARTNER REBOOT (CONTROLLER TAKEOVER)

But the system doesn’t think it was taken over, or that it’s in HA mode.
cluster::> storage failover show-giveback
               Partner
Node           Aggregate         Giveback Status
-------------- ----------------- ---------------------------------------------
Warning: Unable to list entries on node cluster-01. RPC: Port mapper
         failure - RPC: Timed out

cluster::cluster ha> modify  -configured true
Warning: High Availability (HA) configuration for cluster services requires
         that both SFO storage failover and SFO auto-giveback be enabled. These
         actions will be performed if necessary.
Do you want to continue? {y|n}: y
Error: command failed: Could not enable auto-sendhome on partner node: Failed
       to set option cf.giveback.auto.enable. Reason: 169.254.97.26 is not
       healthy.

After another reboot I turned off and on HA, then everything cleared up and TO/GB’s were working perfectly.
cluster::> cluster ha modify -configured true
Warning: High Availability (HA) configuration for cluster services requires
         that both SFO storage failover and SFO auto-giveback be enabled. These
         actions will be performed if necessary.
Do you want to continue? {y|n}: y
Notice: HA is configured in management.

So on to the next problem: one of the vol0’s isn’t being recognized (click for larger image).




After a lot of searching, I found this magical solution. 

Curiously,  the vol0 is referred to in the excerpt above as a “7-Mode volume.”   But both vol0’s are, and there’s no way to change it.  The word from other engineers is that this is correct.
cluster::> vol show -is-cluster-volume -is-cluster-volume false
  (volume show)
Vserver   Volume       Aggregate    State      Type       Size  Available Used%
--------- ------------ ------------ ---------- ---- ---------- ---------- -----
cluster-01           vol0         aggr0_msp_cluster_01                                     online     RW        330GB    310.8GB    5%
cluster-02           vol0         aggr0_msp_cluster_02                                     online     RW        330GB    310.8GB    5%
2 entries were displayed.

Lastly, I needed to move one of the vol0’s.  I used this link to move the vol0 over to a new aggregate:
https://kb.netapp.com/support/index?page=content&id=1013762&actp=search&viewlocale=en_US&searchid=1390428133178

Thanks for making it all the way to the end with me!

No comments:

Post a Comment