Wednesday, January 22, 2014

Adventures with CDOT

Had an interesting situation after an option 4 on a new CDOT system (3250’s), this is a bit long but I wanted to get it all onto paper and into our tribal knowledge.  While one controller was still clearing, I started configuring the other one (set HA true, set to switchless, update from 82p3 to 82p5).  Then I updated the other controller.  When the other system tried to join the cluster, I saw this:

Error: Node "cluster-04" on ring "Management" is offline. Check the health of the cluster using the "cluster show" command. For further assistance, contact support personnel.

Well, ok.  A little checking:
cluster::> cluster show
Node                  Health  Eligibility
--------------------- ------- ------------
cluster-01        true    true
cluster-04        true    true
Warning: Cluster HA has not been configured. Cluster HA must be configured on a
         two-node cluster to ensure data access availability in the event of
         storage failover. Use the "cluster ha modify -configured true" command
         to configure cluster HA.
2 entries were displayed.

cluster::> node show
Node      Health Eligibility Uptime        Model       Owner    Location 
--------- ------ ----------- ------------- ----------- -------- ---------------
cluster-01
          false  true         00:26:23.001 FAS3250              Minneapolis DR Site
cluster-04
          false  true         00:09:39.043 FAS3250
Warning: Cluster HA has not been configured. Cluster HA must be configured on a
         two-node cluster to ensure data access availability in the event of
         storage failover. Use the "cluster ha modify -configured true" command
         to configure cluster HA.
2 entries were displayed.

Well, then let’s modify cluster HA.
cluster::> cluster ha modify -configured true

Warning: High Availability (HA) configuration for cluster services requires
         that both SFO storage failover and SFO auto-giveback be enabled. These
         actions will be performed if necessary.
Do you want to continue? {y|n}: y
Error: command failed: Not enough online nodes in the cluster:
       SL_REMOVE_EPSILON_OOQ_ERROR (code 129)
       There are too few healthy nodes in the cluster to allow join of
       additional nodes. Ensure that the nodes are operational and re-issue the
       command. Use the "cluster show" command on a node in the target cluster
       to view the state of the cluster.

Well sheesh.  So I reboot and what happens?  A takeover.
Jan 22 14:53:28 [msp-cluster-04:callhome.sfo.takeover:CRITICAL]: Call home for CONTROLLER TAKEOVER COMPLETE AUTOMATIC
Jan 22 14:53:28 [msp-cluster-04:callhome.reboot.takeover:error]: Call home for PARTNER REBOOT (CONTROLLER TAKEOVER)

But the system doesn’t think it was taken over, or that it’s in HA mode.
cluster::> storage failover show-giveback
               Partner
Node           Aggregate         Giveback Status
-------------- ----------------- ---------------------------------------------
Warning: Unable to list entries on node cluster-01. RPC: Port mapper
         failure - RPC: Timed out

cluster::cluster ha> modify  -configured true
Warning: High Availability (HA) configuration for cluster services requires
         that both SFO storage failover and SFO auto-giveback be enabled. These
         actions will be performed if necessary.
Do you want to continue? {y|n}: y
Error: command failed: Could not enable auto-sendhome on partner node: Failed
       to set option cf.giveback.auto.enable. Reason: 169.254.97.26 is not
       healthy.

After another reboot I turned off and on HA, then everything cleared up and TO/GB’s were working perfectly.
cluster::> cluster ha modify -configured true
Warning: High Availability (HA) configuration for cluster services requires
         that both SFO storage failover and SFO auto-giveback be enabled. These
         actions will be performed if necessary.
Do you want to continue? {y|n}: y
Notice: HA is configured in management.

So on to the next problem: one of the vol0’s isn’t being recognized (click for larger image).




After a lot of searching, I found this magical solution. 

Curiously,  the vol0 is referred to in the excerpt above as a “7-Mode volume.”   But both vol0’s are, and there’s no way to change it.  The word from other engineers is that this is correct.
cluster::> vol show -is-cluster-volume -is-cluster-volume false
  (volume show)
Vserver   Volume       Aggregate    State      Type       Size  Available Used%
--------- ------------ ------------ ---------- ---- ---------- ---------- -----
cluster-01           vol0         aggr0_msp_cluster_01                                     online     RW        330GB    310.8GB    5%
cluster-02           vol0         aggr0_msp_cluster_02                                     online     RW        330GB    310.8GB    5%
2 entries were displayed.

Lastly, I needed to move one of the vol0’s.  I used this link to move the vol0 over to a new aggregate:
https://kb.netapp.com/support/index?page=content&id=1013762&actp=search&viewlocale=en_US&searchid=1390428133178

Thanks for making it all the way to the end with me!

Thursday, January 9, 2014

Vol Options/Thin Provisioning

Once volume guarantee is set to none the volume is considered thin provisioned.  If thin provisioning is being used, I recommend these for most data sets:
Block:
·         Turn on vol autosize
·         Turn off snapshot reserve
·         Turn off fractional reserve (depends on the workload)
·         Turn on snap autodelete (based on the size taken up by snapshots or their age)
o    commitment=try
o    trigger=volume
o    target_free_space=20%
o    delete_order=oldest_first
o    defer_delete=user_created
o    try_first=volume_grow

Turn off LUN reserve
Turn off LUN Space guarantee 

File:
·         Turn on vol autosize
·         Turn snapshot reserve=20%
·         Turn on snap autodelete (based on the size taken up by snapshots or their age)
o    commitment=try
o    trigger=snap_reserve
o    target_free_space=20%
o    delete_order=oldest_first
o    defer_delete=user_created
o    try_first=volume_grow

Wednesday, January 8, 2014

New Years Technical Blast!

·         General NetApp:
o   2040’s can only run up to ONTAP 8.1.4
o   2240’s can’t run below 8.1
o   8.2 licenses are worthless in 8.1, but you can generate temp licenses easily here: http://support.netapp.com/NOW/download/special/evaluation.cgi?
o   If your telnet connection dies right as you log in, there’s a stale connection.  Kill it with “logout telnet”
·         Snapmirror:
o   When snapmirror initialize gives you a “network error,” it might mean the source volume doesn’t exist/is misspelled/is too large.
o   “sysstat 1” is a quick easy way to view the throughput of a system
o   Options replication.throttle.enable combined with options replication.throttle.incoming/outgoing is a quick, easy way to turn down snapmirror traffic.  The value is in kilobytes/s, so “50,000” = ~50MB
o   A single 1Gb link can transfer as much as 130MB/s if the disks can handle it
o   Reverting a snapmirror destination is a very, very tough process.  Avoid it.
·         ISCSI:    
o   There’s a tool that will generate a script to re-set LUN serial numbers.   Ntstp.netapp.com, at the top menu bar under tools.
o   LUN ID’s, LUN SN’s, and iqn nodenames matter.  Luckily all are easy to set.
o   If you’re updating ONTAP to 8.2, Oracle Linux-based systems using ASMlib can crash entirely.  There’s an easy workaround, see page 61 in the 8.2 release notes.
·         CIFS:
o   CIFSCONFIG_SETUP.CFG contains the list of shares and permissions on the system.  You can just copy that to a new system to emulate them.

·         Bonus: you can separate snapmirror streams by editing your snapmirror.conf file.  Format:
Replication_stream1 = multi(first ip address of source, first ip address of destination)
Replication_stream2 = multi(second ip address of source, second ip address of destination)

Then for each snapmirror relationship, replace the source system’s name with the replication stream’s name.  Example:
rep1 = multi(172.16.0.7,172.16.0.5)
rep2 = multi(172.16.0.8,172.16.0.6)

rep1:volname destnetapp:volname_data kbs=5000 0 1
rep2:volname destnetapp:volname kbs=5000 0 1

·         Double bonus: Here’s a list of other important config files in the system.  Check them out!
KRB (DIRECTORY)
CIFSSEC.CFG
CIFSCONFIG_SHARE.CFG
CIFS_HOMEDIR.CFG
CIFS_NBALIAS.CFG
CIFSCONFIG_SETUP.CFG
EXPORTS
FILERSID.CFG
GROUP
HOSTS
HOSTS.EQUIV
KRB5.KEYTAB
KRB5AUTO.CONF
LCLGROUPS.CFG
NSSWITCH.CONF
PASSWD
QUOTAS
RESOLV.CONF
USERMAP.CFG