IT engineering and a little bit of hacking: January 2014

Wednesday, January 22, 2014

Adventures with CDOT

Had an interesting situation after an option 4 on a new CDOT system (3250’s), this is a bit long but I wanted to get it all onto paper and into our tribal knowledge. While one controller was still clearing, I started configuring the other one (set HA true, set to switchless, update from 82p3 to 82p5). Then I updated the other controller. When the other system tried to join the cluster, I saw this:

Error: Node "cluster-04" on ring "Management" is offline. Check the health of the cluster using the "cluster show" command. For further assistance, contact support personnel.

Well, ok. A little checking:

cluster::> cluster show

Node Health Eligibility

--------------------- ------- ------------

cluster-01 true true

cluster-04 true true

Warning: Cluster HA has not been configured. Cluster HA must be configured on a

two-node cluster to ensure data access availability in the event of

storage failover. Use the "cluster ha modify -configured true" command

to configure cluster HA.

2 entries were displayed.

cluster::> node show

Node Health Eligibility Uptime Model Owner Location

--------- ------ ----------- ------------- ----------- -------- ---------------

cluster-01

false true 00:26:23.001 FAS3250 Minneapolis DR Site

cluster-04

false true 00:09:39.043 FAS3250

Warning: Cluster HA has not been configured. Cluster HA must be configured on a

two-node cluster to ensure data access availability in the event of

storage failover. Use the "cluster ha modify -configured true" command

to configure cluster HA.

2 entries were displayed.

Well, then let’s modify cluster HA.

cluster::> cluster ha modify -configured true

Warning: High Availability (HA) configuration for cluster services requires

that both SFO storage failover and SFO auto-giveback be enabled. These

actions will be performed if necessary.

Do you want to continue? {y|n}: y

Error: command failed: Not enough online nodes in the cluster:

SL_REMOVE_EPSILON_OOQ_ERROR (code 129)

There are too few healthy nodes in the cluster to allow join of

additional nodes. Ensure that the nodes are operational and re-issue the

command. Use the "cluster show" command on a node in the target cluster

to view the state of the cluster.

Well sheesh. So I reboot and what happens? A takeover.

Jan 22 14:53:28 [msp-cluster-04:callhome.sfo.takeover:CRITICAL]: Call home for CONTROLLER TAKEOVER COMPLETE AUTOMATIC

Jan 22 14:53:28 [msp-cluster-04:callhome.reboot.takeover:error]: Call home for PARTNER REBOOT (CONTROLLER TAKEOVER)

But the system doesn’t think it was taken over, or that it’s in HA mode.

cluster::> storage failover show-giveback

Partner

Node Aggregate Giveback Status

-------------- ----------------- ---------------------------------------------

Warning: Unable to list entries on node cluster-01. RPC: Port mapper

failure - RPC: Timed out

cluster::cluster ha> modify -configured true

Warning: High Availability (HA) configuration for cluster services requires

that both SFO storage failover and SFO auto-giveback be enabled. These

actions will be performed if necessary.

Do you want to continue? {y|n}: y

Error: command failed: Could not enable auto-sendhome on partner node: Failed

to set option cf.giveback.auto.enable. Reason: 169.254.97.26 is not

healthy.

After another reboot I turned off and on HA, then everything cleared up and TO/GB’s were working perfectly.

cluster::> cluster ha modify -configured true

Warning: High Availability (HA) configuration for cluster services requires

that both SFO storage failover and SFO auto-giveback be enabled. These

actions will be performed if necessary.

Do you want to continue? {y|n}: y

Notice: HA is configured in management.

So on to the next problem: one of the vol0’s isn’t being recognized (click for larger image).

After a lot of searching, I found this magical solution.

Curiously, the vol0 is referred to in the excerpt above as a “7-Mode volume.” But both vol0’s are, and there’s no way to change it. The word from other engineers is that this is correct.

cluster::> vol show -is-cluster-volume -is-cluster-volume false

(volume show)

Vserver Volume Aggregate State Type Size Available Used%

--------- ------------ ------------ ---------- ---- ---------- ---------- -----

cluster-01 vol0 aggr0_msp_cluster_01 online RW 330GB 310.8GB 5%

cluster-02 vol0 aggr0_msp_cluster_02 online RW 330GB 310.8GB 5%

2 entries were displayed.

Lastly, I needed to move one of the vol0’s. I used this link to move the vol0 over to a new aggregate:

https://kb.netapp.com/support/index?page=content&id=1013762&actp=search&viewlocale=en_US&searchid=1390428133178

Thanks for making it all the way to the end with me!

Thursday, January 9, 2014

Vol Options/Thin Provisioning

Once volume guarantee is set to none the volume is considered thin provisioned. If thin provisioning is being used, I recommend these for most data sets:

Block:

· Turn on vol autosize

· Turn off snapshot reserve

· Turn off fractional reserve (depends on the workload)

· Turn on snap autodelete (based on the size taken up by snapshots or their age)

o commitment=try

o trigger=volume

o target_free_space=20%

o delete_order=oldest_first

o defer_delete=user_created

o try_first=volume_grow

Turn off LUN reserve

Turn off LUN Space guarantee

File:

· Turn on vol autosize

· Turn snapshot reserve=20%

· Turn on snap autodelete (based on the size taken up by snapshots or their age)

o commitment=try

o trigger=snap_reserve

o target_free_space=20%

o delete_order=oldest_first

o defer_delete=user_created

o try_first=volume_grow

Wednesday, January 8, 2014

New Years Technical Blast!

· General NetApp:

o 2040’s can only run up to ONTAP 8.1.4

o 2240’s can’t run below 8.1

o 8.2 licenses are worthless in 8.1, but you can generate temp licenses easily here: http://support.netapp.com/NOW/download/special/evaluation.cgi?

o If your telnet connection dies right as you log in, there’s a stale connection. Kill it with “logout telnet”

· Snapmirror:

o When snapmirror initialize gives you a “network error,” it might mean the source volume doesn’t exist/is misspelled/is too large.

o “sysstat 1” is a quick easy way to view the throughput of a system

o Options replication.throttle.enable combined with options replication.throttle.incoming/outgoing is a quick, easy way to turn down snapmirror traffic. The value is in kilobytes/s, so “50,000” = ~50MB

o A single 1Gb link can transfer as much as 130MB/s if the disks can handle it

o Reverting a snapmirror destination is a very, very tough process. Avoid it.

· ISCSI:

o There’s a tool that will generate a script to re-set LUN serial numbers. Ntstp.netapp.com, at the top menu bar under tools.

o LUN ID’s, LUN SN’s, and iqn nodenames matter. Luckily all are easy to set.

o If you’re updating ONTAP to 8.2, Oracle Linux-based systems using ASMlib can crash entirely. There’s an easy workaround, see page 61 in the 8.2 release notes.

· CIFS:

o CIFSCONFIG_SETUP.CFG contains the list of shares and permissions on the system. You can just copy that to a new system to emulate them.

· Bonus: you can separate snapmirror streams by editing your snapmirror.conf file. Format:

Replication_stream1 = multi(first ip address of source, first ip address of destination)

Replication_stream2 = multi(second ip address of source, second ip address of destination)

Then for each snapmirror relationship, replace the source system’s name with the replication stream’s name. Example:

rep1 = multi(172.16.0.7,172.16.0.5)

rep2 = multi(172.16.0.8,172.16.0.6)

rep1:volname destnetapp:volname_data kbs=5000 0 1

rep2:volname destnetapp:volname kbs=5000 0 1

· Double bonus: Here’s a list of other important config files in the system. Check them out!

KRB (DIRECTORY)
CIFSSEC.CFG
CIFSCONFIG_SHARE.CFG
CIFS_HOMEDIR.CFG
CIFS_NBALIAS.CFG
CIFSCONFIG_SETUP.CFG
EXPORTS
FILERSID.CFG
GROUP
HOSTS
HOSTS.EQUIV
KRB5.KEYTAB
KRB5AUTO.CONF
LCLGROUPS.CFG
NSSWITCH.CONF
PASSWD
QUOTAS
RESOLV.CONF
USERMAP.CFG

Pages

Wednesday, January 22, 2014

Adventures with CDOT

Thursday, January 9, 2014

Vol Options/Thin Provisioning

Wednesday, January 8, 2014

New Years Technical Blast!