IT engineering and a little bit of hacking: September 2011

Thursday, September 29, 2011

NetApp Insights: Shelf Shutdown

Now that we know we can perform a shelf reboot live, we got a bit adventurous.

The question we were trying to answer is "Could we replace/remove a shelf on a live system without causing downtime?" I used a 3160 cluster in the lab with 4 DS14s in a loop, slowly failed all the disks in shelf 3, and removed ownership on those disks. At that point, I could shut down/unplug that shelf at will, and neither system complained except noting that they were transitioning to single-path.

I doubt NGS will ever give the plan their full blessing, but it's good to know that it's ok from a technical standpoint.

Update 1: I also successfully swapped out a shelf chassis in this manner in the lab. The controllers were totally ok with a new serial number! No issues that I could find.

Update 2: NGS did in fact OK this action plan twice, but later completely backed out. There's concern that the system will keep the shelf registered in the OS somewhere. A possible solution for this is the perform a failover/give back for each node after the shelf removal, since failover/giveback includes a reboot.

Wednesday, September 28, 2011

An open letter to all University Presidents

An open letter to all University Presidents:

A budget breakdown my alma mater mailed to me showed that 62% of our budget is staff and faculty salary and benefits. While I applaud the transparency, that makes it pretty hard to look favorably on a donations request: tuition has increased there 29.36% since 2005, in the midst of the greatest economic crisis since the Great Depression.

You University Presidents no doubt have many reasons for this: the marketing perspective on price being perceived as value in competition with other universities, competition for good faculty, and I know that few if any students pay the full amount of tuition. But in the midst of staggeringly high unemployment, American philanthropists have wiser and more deserving places to put their means when your university asks for support.

During my 4 years internship, my CEO asked our entire company, himself included, to take a pay freeze. And we did it willingly because we understood the investment we were making in an institution we believed in.

Has your university asked for similar sacrifices from its employees? Or will the future bear the brunt of this generation's economic mistakes? More debt on the back of our youth is not the answer to your university's future.

Thank you

Thursday, September 22, 2011

NetApp Experience: Shelf Add => Disk Fail

One of my practices when performing a shelf add is to wait in between each step, specifically between unplugging and re-connecting any cables. My thought process on this has been that the system should be allowed time to settle to its new circumstance, specifically that the controller will need to recognize what paths it is now able to communicate to the disks on.

Digging deeper, one thing I recently learned is that the disk has two ports of communication (referred to as A and B) to the shelf modules, and they negotiate their paths from the disk to the shelf module to the fiber ports on controller. e.g., Disk 23 port A could be connecting through shelf module B to port 2c, and disk 23 port B could be connecting through shelf module A to port 1b.

All of that is important to understanding the serious issue that failed two disks in a production cluster recently. A single shelf (DS14mk2 750GB SATA) was connected MPHA to a clustered pair with this configuration:

disk 1d.18: disk port A to shelf module B to port 1d
disk 2b.22: disk port B to shelf module A to port 2b

After unplugging the cable from 1d to shelf module B, there was a 17 second delay and then this:

Cluster Notification mail sent: Cluster Notification from CONTROLLER (DISK CONFIGURATION ERROR) WARNING
Controller> scsi.path.excessiveErrors:error]: Excessive errors encountered by adapter 2b on disk device 2b.18.
Controller> scsi.cmd.transportError:error]: Disk device 2b.22: Transport error during execution of command: HA status 0x9: cdb 0x28:354a07b8:0048.
Controller> raid.config.filesystem.disk.not.responding:error]: File system Disk /aggr2/plex0/rg0/2b.22 Shelf 1 Bay 6 [NETAPP X268] is not responding.
Controller> scsi.cmd.transportError:error]: Disk device 2b.18: Transport error during execution of command: HA status 0x9: cdb 0x28:4f5e9748:0048.
Controller> disk.failmsg:error]: Disk 2b.22: command timed out.
Controller> raid.rg.recons.missing:notice]: RAID group /aggr2/plex0/rg0 is missing 1 disk(s).
Controller> raid.rg.recons.info:notice]: Spare disk 2b.27 will be used to reconstruct one missing disk in RAID group /aggr2/plex0/rg0.

Controller> diskown.errorReadingOwnership:warning]: error 23 (adapter error prevents command from being sent to device) while reading ownership on disk 2b.18

Analysis:
These two disks failed as a result of an HBA issue last night. When a path is disconnected, any disks that are owned over that path are engineered to use the redundant path. When we disconnected port 1d, the HBA in slot 2 produced errors that halted this normal renegotiation for two of the disks. Because the disks were not able to operate on the redundant path, the system considered the disks to be in a failed state and began reconstructing that data to spare disks. When this happened, we halted work to investigate and remediate. We'll probably just RMA the disks to reduce variables.

NetApp Support recommendation: Re-seat this HBA, which would require a failover/giveback to perform. Another option would be to replace the HBA (which is what we'll probably do).

Update: NGS (NetApp Support) has changed their minds and now think this is a disk firmware issue. This disk firmware was backrev'd a couple years still, and their explanation is that iSCSI errors caused by the firmware pile up over time and eventually cross a threshold and cause an HBA to consider that disk incommunicable. There's no warning on the system that this HBA can't talk to that disk, and all the traffic is routed through the redundant path.

In this case, we had two disks that were in this situation and when I unplugged the return path (the path they were active on) they tried to fail over to the other path and could not. NGS believes this was just a pure chance, struck by lightning situation.

I'll post the bug report on this soon: the gist of it is that between 40C and 50C, a latching mechanism can get "stuck" and error out, but will quickly recover. I'm skeptical of this because the highest temperature observed in this shelf was 36C.

2nd Update: As best as I can tell, the disk firmware update did the trick. We went through with shelf adds last night without seeing the same behavior. We did, however, see what we believe to be a separate issue.

Monday, September 19, 2011

NetApp Training Brain Dump: Experimenting

Quick notes on things I tested today:

If you run a disk fail command, you will have to wait a few hours for the data on that disk to be copied to a spare.

There is a -i trigger for the disk fail command that will immediately fail the disk, without copying the data.

If you have no spares and you have a disk that is not assigned and is not in use, you have to assign that disk to the controller before it will be used as a spare. If you have options disk.auto_assign on, it will have already been assigned to a controller. In either case, you won't need to add the disk to an aggregate: the system detects it as a spare and grabs it in the place of the failed disk.
To see how many failed disks you have, use vol status -f
To see how many spares you have, use vol status -s
If you want to see the status of your disks, disk show won't do it. You'll need to use disk show -v to see failed disks, and neither will show spare disks as being spare.
You can't resize an aggregate's RAID roups. You can however use aggr options raidsize to set the size for new RAID Groups that are created for this aggregate.

Thursday, September 15, 2011

NetApp Insights: Usable Capacity

I saw some documentation given to a customer that estimated that for 144 2TB SATA disks (294.9TB), the customer could expect 170TB usable. It also said for 288 450GB SAS disks (126.5TB) they should expect 91TB usable. That's a big loss from a client's perspective.

I've previously developed a calculator to make it easy to plan your RAID Groups and aggregates, but now I want to use that to take a closer look at where all that space actually goes. A NetApp PSC expert told me the general rule is for FC disks 70% of raw is usable, and for SAS/SATA you take off another 10-12%. But let's see if we can dig into that.

Computers measure base 2, but drive manufacturers measure base 10. This means if your drive is labeled 1GB, it's actually 1000MB, not 1024MB.
The fuzziest part: drive manufacturers reserve between 7% and 15% on each disk. Some of this is for parity, a lot of this is to account for failed sectors. I've observed 2TB SATA drives reporting 1.69TB or less for a loss of 13.5%, I'll use that for these calculations.
You lose some space due to WAFL/disk asymmetry. The basic idea is that a 4KB block doesn't fit neatly into the disk sectors, so there's some waste. Some of this is taken into account by the manufacturer's reserve, so I can't quantify this in our calculations.
You lose some space to right-sizing. Since each drive manufacturer's 2TB disk is a slightly different size, ONTAP right-sizes all disks to the lowest common denominator to avoid incompatibly sized disks later. I can't find any data on how much space you lose to this process.
For every RAID Group, you lose 2 disks to parity/double parity.
You need to account for spares obviously.
WAFL requires 10% of the usable space to run the file system.

So for our two scenarios above, here's what we find:
Scenario 1
288 450GB SAS Drives
Spare drives: 8
Parity drives: 28

Credit: me!

Scenario 2
144 2TB SATA Drives
Spare drives: 6
Parity drives: 20

Credit: me!

Analysis:

You lose a consistent 15% because of the drive manufacturer whether you use EMC or NetApp or any other vendor.
To accomplish NetApp's goal of data protection (spares, parity, WAFL striping), you lose another 18-23%.
When you factor in backups and snapshots, you'll lose even more space.
One bright side is that using NetApp's dedup and efficient snapshot technologies, you can end up regaining this lost space.

Notice I'm still a considerable way away from the estimates given to the customer: 7.8% low for the 450GB system and 5% high for the 2TB system. There's still some gaps in my numbers here, I would definitely appreciate any tips!

Tuesday, September 13, 2011

FlashCache Pros and Cons

Ran across a brilliant article over at The Missing Shade of Blue that brought up something I'd never considered: latency is the real way to measure speed. Throughput is a vital statistic to be sure, but if your throughput comes at the cost of latency you really have to consider that trade off.

Bit of background for noobs: obviously, fast data storage is more expensive than slow data storage. Data tiering is pretty much the same idea as storing things in your closet: the stuff you use more often are in the front (fast, expensive storage like Flash memory), the stuff you never take out can be hidden deep in the back (slow, cheap storage like SATA).

Some companies, like Compellant, run algorithms to see what data is being read/written to infrequently and move that data to SATA, while the frequently-used data is moved to your faster SAS drives. NetApp (and EMC after they saw the success NetApp achieved) short circuit this a bit by just adding a single, super-fast tier of cache.

NetApp FlashCache is 512GB of super-fast flash memory. EMC FAST Cache are actually SSD's. Frequently accessed data is kept here so that the system doesn't have to go all the way to the drives, which increases the amount of data packets per second (IOPS) you are able to write or read.

The point that really struck me is that some Cache products, which create a super high tier for your data, can kill you on write latency. It turns out that EMC FAST Cache either increases write latency because of how it's implemented or opens the spigot for max IOPS so wide that the rest of the system can't keep up, exposing other bottlenecks. I'm sure that at some point if you throttle down the IOPS you'll see the write latency settle down. You'll still get a marked increase in IOPS, without the write latency.

This doesn't by any means settle that Cache products have no place in your SAN (it's still a fantastic performance boost for the money), but it does mean you have to factor in this effect when making the decision.

Monday, September 12, 2011

Virtual Technology Industry Analysis

Here's a pretty awesome analysis of virtual tech customers compiled by Forrester. Here are the highlights:

Figure 1: VMware is dominating. 93% of virtual tech customers have VMware.
Figure 2: In the rough economy, SAN customers are focused on:

Space utilization (efficiency): 53%
Cost: 39%
Performance: 30%

Figure 3: SAN vendor.

44% EMC
38% NetApp
24% HP/Lefthand
22% IBM.

67% of customers have only one storage vendor in their datacenter. I think that this is because only the bigger players can afford to create price competition in their environment, or perhaps only they really benefit enough in pricing to make it worth their while maintaining two or more products.
Figure 5: Protocol:

76% FC
37% NFS (up from 18% two years ago)
23% iSCSI

Notice on the second bullet (customer focus), both 1 and 2 are about cost. Customers understand that a higher dollar amount can save money in the long run using dedup, great snapshot management, and overall space efficiency. This also reflects the tough economy, and predicts skinnier margins for the storage industry.

http://media.netapp.com/documents/ar-storage-choices-for-virtual-server-environments.pdf

Friday, September 9, 2011

NetApp Insights: NDR Shelf Reboot

Got to witness a NetApp expert at work yesterday as he did some tests on a pretty cool capability that I hadn't heard of before. In ONTAP 7.3.2, DS14 shelves, in certain hardware configurations (read the KB below), allow you to suspend IO to the shelf for a certain period of time so it can be rebooted without the system panic'ing.

The basic idea is this: normally, if a shelf disappears off the loop, the system would catch the error and panic. In this case, the system goes into a mode where it tolerates this for a certain period of time through a combo of queue'ing or suspending traffic to the affected disks. In practice, you will see affected volumes suspend traffic for a short period of time. After the shelf reboot is complete, entering the power_cycle shelf completed command takes the system out of that mode and returns it to normal error catching.

For certain configurations the shelf will actually reboot automatically, and for older hardware/software combos the system will give you 60 minutes to manually shut off the power to the shelf. The specs say to expect up to a 60s suspension in traffic: in our tests, the automatic reboot took 11s and the manual one took up to 45s.

Here's an example of the command that reboots shelf 3 on loop 6a: storage power_cycle shelf start -f -c 6a -s 3
And here's the syntax:

power_cycle shelf -h
power_cycle shelf start [-f] -c [-s ]
power_cycle shelf completed

Attempts to power-cycle a selected shelf in a channel or all the shelves in a channel.

'power_cycle shelf completed' command must be used, as and when instructed by the 'power_cycle shelf start' command.
-f do not ask for shelf power-cycle confirmation
-c if option -s is not specified power-cycle all shelves connected to the specified channel. if option -s is specified, power-cycle shelf id on specified channel.
-h display this help and exit.

One idea we tested in the lab was using this to change the shelf ID while the system is still online. The shelf we rebooted had the mailbox disks on it, which caused a panic and a failover. This may still be possible in some conditions, I'll update as we figure this out.

Some ideas I want to test out:

What happens if a shelf other than the one you specified in the command goes offline? Is the system targeted in its tolerance of shelf loss, or does its tolerance extend across all shelves?
For setting the shelf ID:

What difference does it make if none of the disks are owned/are spares?
What if none of the disks in the shelf are mailbox disks?

https://sa.netapp.com/support/,DanaInfo=kb.netapp.com,SSL+index?page=content&id=3012745

Note: This capability is not available for DS4243's.

Wednesday, September 7, 2011

NetApp Insights: MetroCluster

My main criticism of NetApp's MetroCluster implementation is the same as this guy's; it has single points of failure.

Let's rewind a bit. NetApp has a product called a Fabric MetroCluster, in which you pretty much pick up one controller out of your HA pair and move it to another datacenter (I'm simplifying things). It's a good implementation in that it spreads the reliability of a single system out across two datacenters and replicates in real time. It's a bad implementation in that it's still a single system.

Everything can fail, so in SAN, the name of the game is redundancy. This is why customers buy TWO controllers when they purchase a HA system, even though both controllers are going into the same datacenter: each controller has ~6 single points of failure, and if it goes down, you still need your data to be served. By providing a redundant controller, you can lose a controller and your customers won't even notice. That's why we refer to a HA clustered pair as a single system: the cluster is a unit, a team.

You don't have the same luxury (without massive expense) when you spread your cluster across two datacenters. The reason you geographically locate your SAN system in the same datacenter as your servers is that there's a large amount of traffic going back and forth from the SAN system to the servers. Trying to pump all that traffic through an inter-site link (ISL aka inter-switch link) requires a serious pipe, which is very expensive.

If your SAN system goes down, the DR plan is typically to failover the clustered servers at the same time as the SAN system, a complex and often risky procedure. By failing both over, you ensure the traffic does not need to travel over the ISL, which would likely create latencies beyond the tolerances of your applications. But a better solution is to make sure your SAN system is redundant in the first place so you don't need to fail over.

This is why NetApp's current MetroCluster implementation falls short: it has 6 single points of failure that would require you to either push all traffic through your ISL or fail EVERYTHING over. That's not, in my opinion, enterprise-class.

Good news though - looks like NetApp might be planning on fixing this to allow a clustered pair at each datacenter.

Tuesday, September 6, 2011

NetApp Experience: Shelf ID

Encountered something cool recently that totally stumped NetApp experts: a DS4243 shelf whose shelf ID had gone crazy. The ID was set to 19 when it should have been set to 11, and the 1 was blinking. The system recognized the ID as 19 and functioned normally, but the shelf would not respond to the shelf-ID selector button that should have allowed me to change it. There was a disk drive missing in slot 4: this turned out to be unrelated as far as I can tell. At the software level, ACP and everything else just saw the ID as 19! Steps I tried:

- Power cycle the shelf (no effect).
- Change shelf ID (Wouldn't respond).
- Reseat the IOM modules (no effect).
- Update firmware (no effect).
- Replacing missing drive (no effect).

Got on the phone with NGS, and at the end of the day there was nothing else we could try. They shipped out a new chassis and we swapped it out, placing the old disks, power supplies, and IOM modules into the new chassis. Set the new chassis's shelf ID and everything worked great!

Details for future reference:
1TB DS4243 with IOM3's hooked up to a 6080 cluster, MPHA. 2 stacks of 2 shelves.

Friday, September 2, 2011

NetApp Training Brain Dump: Snapshots

The concept here is that a snapshot can become as large as the original dataset in the volume (100%). Remember that the space occupied by data in the volume is the sum of the existing LUNS/Qtrees and any snapshots that exist in that volume. Empty space in the volume is ignored by snapshots.

Here's the important background idea: WAFL does not update-in-place when existing data is changed. This means that for a normal LUN that has no snapshots, when data changes, it is written to a new location (total space occupied increases) and then the old data is deleted (total space occupied goes back to pre-change levels).

Illustration: If in a LUN with 6GB of data a 4KB block is changed, the sum total of space occupied by data rises to 6GB + 4KB, then back to 6GB as the out of date 4KB block is deleted and reclaimed. WAFL handles this so quickly that your LUN effectively does not increase in size. This is a great advantage for WAFL because update in place can cause data corruption.

This concept is essential to understanding how snapshots work in ONTAP. Let's go back to our 6GB LUN with a 4KB change: WAFL writes the new 4KB data to new, unoccupied space and the snapshot is left occupying the space that would be otherwise deleted. So as data changes, it is not actually the snapshot that is allocated more space, but its existence means that the space that could be reclaimed is now solely assigned to the snapshot. So any data that is only assigned to the snapshot is considered occupied by the snapshot. In this example, the snapshot would be considered to be 4KB in size.

If you're a visual learner like me, checking out this diagram will help you picture the concept.

The size taken up by the snapshot increases in concert with the changes to the original LUN: 500MB of changes to the original LUN means that the snapshot will grow from 0 to 500MB in size. For 20GB volume that has 6GB of data (including LUNs and other snapshots), the next snapshot can grow as large as 6GB, making the sum total of the original data and the new snapshot 12GB.

You can find commands to control snapshots here.