IT engineering and a little bit of hacking: October 2011

Monday, October 31, 2011

Brazilian Jiu Jitsu

I've decided to begin including one of my other passions into this blog: Brazilian Jiu Jitsu. I also train boxing on the side, but my BJJ is where my heart is.

Recognizing how impactful a tool writing can be to learning and progression, I'm going to begin adding my insights to the sport's evolution. I probably won't talk much about technique and moves (youtube videos will always be better than print for that) but more about the culture, mindset, progression, and strategies.

My second BJJ tournament was a couple weekends ago: I went 7-3, placed 2nd in my weight class twice, and took 3rd for absolute no-gi.

Notes from that tournament: eat more in the morning. Bring more Gatorade. Bring a coach! Understand the rules better. Work on guillotines, work on side control escapes. It's better to have energy against bigger guys than be tired against smaller guys. Take body fat from 11% to 7%. Cardio is everything! Work on armbars from guard. Work on stand up (esp darce and guillotine).

Next tournament is Dec 10th, and I need to drop to 200 lbs. I'm walking around at about 208 right now...let's get to work!

Finding Misaligned VMDK's

Received a useful email from a colleague today:

"Starting with ONTAP 7.3.5, ONTAP has a nice feature to help identify misaligned VMDKs...this feature will be helpful as we roll out 7.3.6P1. At the end of the "nfsstat -d" command, you will see a section named "Files Causing Misaligned IO's". This will have a list of files that are doing misaligned I/O, along with a counter that indicates the frequency at which this IO is happening. If you want to start the counters over again, you can use "nfsstat -z" to zero the counters.

Below is a snippet of this output from a filer (the VMDKs with high counter values), which has been having some performance problems lately. We have 18 VMs here doing a significant amount of misaligned IO since the upgrade was done on Saturday night (there are 48 VMs in total doing misaligned IO). We need to get these VMDKs aligned in order to help improve the write performance on this system."

Files Causing Misaligned IO's [Counter=48113], Filename=infra_pv_vms_v03_snap14/ds1/c111asz/c111asz_1-flat.vmdk [Counter=18865]

Thursday, October 27, 2011

NetApp Experience: Shelf Add => Disk Bypass

During a shelf add last night, we ran into another hairy situation. Turns out we didn't connect the new shelf smoothly enough when sliding the SFP into the port, which caused us to see this:

[Filer: esh.bypass.err.disk:error]: Disk 7b.50 on channels 7b/0d disk shelf ID 3 ESH A bay 2 Bypassed due to excessive port oscillations

[Filer: ses.drive.missingFromLoopMap:CRITICAL]: On adapter 0d, the following device(s) have not taken expected addresses: 0d.57 (shelf 3 bay 9), 0d.58 (shelf 3 bay 10), 0d.59 (shelf 3 bay 11), 0d.61 (shelf 3 bay 13), 0d.67 (shelf 4 bay 3), 0d.70 (shelf 4 bay 6), 0d.74 (shelf 4 bay 10), 0d.75 (shelf 4 bay 11)

[Filer: shm.bypassed.disk.fail.disabled:error]: shm: Disk bypass check has been disabled due to multiple bypassed disks on host bus adapter 0d, shelf 3.

[Filer: shm.bypassed.disk.fail.disabled:error]: shm: Disk bypass check has been disabled due to multiple bypassed disks on host bus adapter 0d, shelf 4.

[Filer: ses.exceptionShelfLog:info]: Retrieving Exception SES Shelf Log information on channel 0d ESH module A disk shelf ID 3.

In fcadmin device_map, this looked like this:

Loop 0d

Loop 7b

Note that each loop saw a different number of bypassed disks. Sysconfig -r, disk show -n, and vol status -f all came back normal. A little backstory here: ONTAP bypasses disks because in certain scenarios, a single disk can lock up an entire FC loop (read here for more info on this). This is not the same thing as failing a disk: there are various situations where the disk will just be ignored by the system.

This thread indicated to us that the fix would likely be slowly reseating the disks. You have to respect the filer: pulling and pushing a ton of disks consecutively may cause unexpected consequences, so wait at least 90s in between each action. Pull, 90s, reseat, 90s, pull another, etc.

An example of unexpected consequences is below: one disk reacted poorly to being re-seated, and failed. When the system re-scanned the disk after it was pushed in, we saw this:
[Filer: disk.init.failureBytes:error]: Disk 0d.70 failed due to failure byte setting
We attempted to reseat the disk again, to the same effect. The disk didn't show up in vol status -f. We also tried to unfail the disk, to no effect. Here's how we fixed it, with 90s in between each step:
1) pull failed disk
2) pull another disk
3) swap failed disk into other slot
4) swap other disk into failed disk's slot
5) disk show
6) priv set advanced
7) disk unfail 7b.71 (this is the slot the failed disk was in).

Thursday, October 20, 2011

NetApp Experience: ONTAP 8.0.2 Upgrade

Love this. Had a customer who upgraded ONTAP from 7.3 to 8.0.2 before my scheduled work, and things got interesting. Their SQL server couldn't see its LUNs, which is were the databases obviously reside. Unfortunately for them, they had placed their configuration files on a LUN as well, and part of that configuration was "what data lives on which LUN?" They just pointed SQL down the correct path to find the configuration file, and SQL did the rest.

Didn't take too long to figure out and didn't cause any production outage, so it was pretty enjoyable to watch. Talk about a *doh*!

Tuesday, October 18, 2011

Business Travel

Things I've learned so far:

Never pass up an opportunity to charge your laptop/phone.
Bring noise-canceling earbuds.
Always travel with at least $40 in cash.
Bring a second pair of pants.
"He who travels happily, must travel light"
Don't be a lemming: why rush to be first on a plane?
Never check a bag if you can avoid it.

How about you, any travel tips?

Wednesday, October 12, 2011

NetApp Experience: CIFS Error

Here's a new case: we have a filer that we're working on getting data off of to retire it, and ran into an ONTAP bug: this filer is unable to execute CIFS commands that will allow us to rename or offline volumes. This is a big problem for use because renaming and offline-ing volumes are part of our retirement process.

Part of the solution for this is called an "NMI reboot." NMI stands for non-maskable interrupt, which basically means the software in the computer is incapable of ignoring this reboot. You may be familiar with the small pinhole button on a lot of consumer hardware that would "hard reset" the system: that's it.

This system is a clustered FAS980 running ONTAP 7.2.7. The plan is to use that pinhole button to reset the filers one at a time: when a reset occurs, the system should failover, and no noticeable downtime should result. After the reset, we'll do a giveback, let everything settle, and repeat the process on the other system.

Friday, October 7, 2011

NetApp Experience: Bad Slot

Really very interesting things have happened lately. I had a shelf add that kicked out a ridiculous amount of errors for one disk on the new shelf:

disk.senseError:error]: Disk 2d.53: op 0x28:0000a3e8:0018 sector 0 SCSI:hardware error - (4 44 0 3)

diskown.RescanMessageFailed:warning]: Could not send rescan message to eg-naslowpc-h01. Please type disk show on the console for it to scan the newly inserted disks.

diskown.errorReadingOwnership:warning]: error 46 (disk condition triggered maintenance testing) while reading ownership on disk 2d.53

Disk 2d.53: op 0x28:0000a3f0:0008 sector 0 SCSI:hardware error - (4 44 0 3)
diskown.AutoAssignProblem:warning]: Auto-assign failed for disk 2d.53

The weird thing was that the messages just continued to loop rather than just fail the disk. We swapped a new disk into that slot, and the old disk into a different slot to see if the disk was bad: turns out, the slot is bad.

We also tried reseating shelf Module B on that shelf. NetApp Support informed me that "Module A handles communication to the even numbered disks by default, and Module B the odd disks." I don't think this is true.

We're working with the customer to find a good resolution for this. Since downtime is difficult to accomplish, we may try to swap out the shelf chassis while the system is running. We'll see :-)