Wednesday, November 30, 2011

AT&T and iPhone

Since my AT&T contract is up, I'm considering upgrading my iPhone 3GS.  The rep told me I could save money on my plan by going to a 2GB data limit ($25 instead of $45).  I need to keep text and voice as unlimited - my experience is that you'll end up going over and spending that money anyway, so you might as well use it.

Last time I checked, I never used anywhere near 2GB/month...so I checked again, and here's what I found: 
Credit: Me and AT&T
My data usage on the same 3GS I've always had has shot through the roof.  That may be my job (switched in April), that may be a change in my behavior, that may be software eating up data passively.  I don't know - I need to break this down further.

I went to their website | wireless | usage & recent activity | "more details" under "Data" | Selected "Filter Data Usage by: Internet/MEdia Net" and then sorted by transmission size.  I found 0.833GB in November from 32 entries in 18 days of usage.  I have a great wireless connection at home, so I'll try to utilize that better for the next 12 days and see what the results are.

Wednesday, November 16, 2011

NetApp Insights: Sequential Reads

In the spirit of customizing NetApp technology for your workload and needs, one thing to talk about is sequential reads. The vast majority of customers don't have to worry about this because WAFL is designed to meet 99% of normal use cases, but in some unique situations it pays to optimize.


 For example, if your workload is 20 minutes of random writes followed by 20 minutes of sequential reads, and you don’t use any reallocate technology, you will likely see declining performance on your NetApp system over time. Who has this workload?  Basically nobody.  But for less extreme cases, you may find performance boosts through customizing your volume for sequential reads like this guy, who got a 6% throughput increase.  


Here are a couple steps you can take:

  • set vol option read_realloc.
  • option wafl.optimize-write-once=false. 
  • schedule a regular reallocate scan.

About read reallocation (Quote)


    For workloads that perform a mixture of random writes and large and multiple sequential reads, read reallocation improves the file's layout and sequential read performance. When you enable read reallocation, Data ONTAP analyzes the parts of the file that are read sequentially. If the associated blocks are not already largely contiguous, Data ONTAP updates the file's layout by rewriting those blocks to another location on disk. The rewrite improves the file's layout, thus improving the sequential read performance the next time that section of the file is read. However, read reallocation might result in a higher load on the storage system. 


   Also, unless you set vol options vol-name read_realloc to space_optimized, read reallocation might result in more storage use if Snapshot copies are used. If you want to enable read reallocation but storage space is a concern, you can enable read reallocation on FlexVol volumes by setting vol options vol-name read_realloc to space_optimized (instead of on). Setting the option to space_optimized conserves space but results in degraded read performance through the Snapshot copies. Therefore, if fast read performance through Snapshot copies is a high priority to you, do not use space_optimized. 


   Read reallocation might conflict with deduplication by adding new blocks that were previously consolidated during the deduplication process. A deduplication scan might also consolidate blocks that were previously rearranged by the read reallocation process, thus separating chains of blocks that were sequentially laid out on disk. Therefore, since read reallocation does not predictably improve the file layout and the sequential read performance when used on deduplicated volumes, performing read reallocation on deduplicated volumes is not recommended. Instead, for files to benefit from read reallocation, they should be stored on volumes that are not enabled for deduplication. The read reallocation function is not supported on FlexCache volumes. 


If file fragmentation is a concern, enable the read reallocation function on the original server volume.  (/Quote)


More reading here.  Interesting quote from this additional reading: "For write-intensive, high-performance workloads we recommend leaving available approximately 10% of the usable space for this optimization process."  Luckily that 10% includes any white space, including unused space carved out for thick provisioned LUN's.

Tuesday, November 15, 2011

NetApp Insights: RAID Group and Aggregate Setup

 Since I'm always looking out for the underdog, I'm aware that many customers can't afford to buy 20 shelves of disks at a time.  And even though they are smaller customers, getting in at a ground floor with great technology and earning loyalty from a customer is a big priority for any smart company.

If you've just invested in a small NetApp deployment, here are the questions you should be asking yourself:
1.  How can I get the most usable space out of my investment?
2.  How can I ensure full redundancy and data protection?
3.  What configuration will squeeze the most performance out of this system?
4.  Where are my performance bottlenecks today in #3's configuration?
5.  How long will it take to saturate that bottleneck, and what will be my plans to expand?

I'm going to discuss the configuration options that will both maximize your initial investment and set you up for success in the long term.  Be aware that this is a textbook study of tradeoffs between stability, scalability, space, and performance.

A few basics: 
1.  Each controller needs to put its root volume somewhere.  Where yo put it makes a big difference when working with <100 disks.
       a. For an enterprise user, the recommended configuration is to create a 3 disk aggregate whose only responsibility is to hold this root volume, which requires no more than a few GB's of space.  If you only purchased 24 or 48 disks, you could understandably consider this to be pretty wasteful.
     The rational behind this setup is that you isolate these three OS disks from other IO, making your base OS more secure.  More importantly, if you ever have to recover the OS due to corruption, the checkdisk process will only have to run against 3 disks rather than several dozen.  Lastly, if your root vol ever runs out of space, expect a system panic.  By creating a 3 disk aggregate, you protect it from being crowded out by expanding snapshots.
       b.  For a smaller deployment, another option would be to create an aggregate spanning those 24-48 disks and have the root volume reside there.  This is a valid option that is taken by many customers.

2.  Each RAID Group (RG) has 2 disks dedicated to parity.  Consider this when looking at space utilization.

3.  You typically want to avoid having a mixed ownership loop/stack.  What this means is within a stack of shelves, do your best to have the disks only owned by a single controller.  This is not always achievable right away, but could be after an expansion.
4.  Before creating a RG plan for any SAN, one should read TR-3437 and understand it thoroughly.  It covers everything you need to know.

Scenario:  You purchase a FAS3100 series cluster with 2 shelves of SAS 450GB, no FlashCache.  Here are a few of the options available.  Note: these IOP numbers are not meant to be accurate, just illustrate the relative merits of each configuration.
1.  Create two stacks of 1 shelf.  Create two RG's per shelf of 11 and 12 disks each, combined in one aggregate. Leave one spare disk per shelf.  Results:
Usable space: 12.89TB
IOPS the disks can support: 175 IOPS/disk * 19 data disks = 3325 IOPS each controller.
Advantages: Easily expandable (existing RG's will be expanded onto new shelves, improving stability), full controller CPU utilization available, volumes on each controller are shielded in terms of performance from volumes on the other controller.
Disadvantages: Lower usable space, lower IOPS available for any one volume, no RG dedicated to root volume.

2.  Create two stacks of 1 shelf.  Create 1 RG per shelf of 23 disks each, combined in one aggregate.  Leave one spare disk per shelf.  Results:
Usable space: 14.26TB
IOPS the disks can support: 175 IOPS/disk * 21 data disks = 3675 IOPS each controller.
Advantages:  Higher usable space, lower IOPS available for any one volume, full controller CPU utilization available, volumes on each controller are shielded in terms of performance from volumes on the other controller.
Disadvantages: Lower IOPS available for any one volume, no RG dedicated to root volume, lower data protection because of the large RG size, lower stability when expanded because the entire RG is located in one shelf.

3.  Create 1 stack of 2 shelves for an active/passive config.  Create 4 RG's (14, 14, 15, 3), with the large RG's combined in one aggregate and the 3 disk RG in another.  Leave one spare disk per shelf.  Results:
Usable space: 12.39TB
IOPS the disks can support: 175 IOPS/disk * 37 data disks = 6475 IOPS for only the active controller.
Advantages:  High IOPS available for any one volume, volumes on each controller are shielded in terms of performance from volumes on the other controller, expandable (existing RG's will be expanded onto new shelves, improving stability).
Disadvantages: Lower usable space,  no RG dedicated to active controller root volume, only half the CPU power of the cluster used.

4.  Create 1 stack of 2 shelves for an active/passive config.  Create 3 RG's (22, 21, 3), with the two largest in one aggregate and the 3 disk RG in another.  Leave one spare disk per shelf.  Results:
Usable space: 13.08TB
IOPS the disks can support: 175 IOPS/disk * 39 data disks = 6825 IOPS for only the active controller.
Advantages:  Highest IOPS available for any one volume.
Disadvantages:  Lower usable space, no RG dedicated to root volume on active controller, lower data protection because of the large RG size, lower stability when expanded because the entire RG's are located in two shelves, only half the CPU power of the cluster used.

5.  Create 1 stack of 2 shelves for an active/passive config.  Create 4 RG's (20, 20, 3, 3), with the two largest in one aggregate and two root aggregates of 3 disks..  Leave one spare disk per shelf.  Results:
Usable space: 11.89TB
IOPS the disks can support: 175 IOPS/disk * 36 data disks = 6300 IOPS for only the active controller.
Advantages:  RG dedicated to root volumes, high IOPS for active controller.
Disadvantages:  Lower usable space, lower data protection because of the large RG size, lower stability when expanded because the entire RG's are located in two shelves, only half the CPU power of the cluster used.

Here's the break down:

Credit: Me!
This post is long enough already so I'll keep the conclusion short: understand the requirements of your application, and use the examples above to help customize NetApp systems to meet those specs at a low price.

NetApp Insights: FlashCache Doesn't Cache Writes

Below: brilliant article on why NetApp designed its cache product to only affect reads.

http://communities.netapp.com/community/netapp-blogs/efficiency/blog/2011/02/08/flash-cache-doesnt-cache-writes--why

Update: Since writes in ONTAP require several reads to accomplish (updating bitmaps, etc), flashcache can speed up writes.  But since that data is often in system memory already, this is complicated and depends on the situation.

Friday, November 11, 2011

BJJ Tournament Progress

One of the tough parts of competing isn't cutting weight: it's how cutting weight affects your training.


I'm down to 204.2lbs this morning, and when you take less calories in than you burn your body feels tired and achy and sore.  I wonder if it would be beneficial to add 300 calories, and also add 300 calories of exercise?  Is it the calorie deficit that matters in the way your body feels, or the overall nutrition?


Either way, at 11% body fat I certainly have more pounds to cut comfortably without worrying about impact on health too much.  It seems I can lose a half a pound a day without pushing myself too hard.  


Since I can weigh in the day before (Dec 9th at 6pm), I can lose 3 lbs of water weight, and then gain it back overnight.  Which means I really should be 203lbs in the morning! 


In the meanwhile, I've made a lot of progress escaping side control.  I need to spend time working on my guillotines and darce's next.  Booyah!

Wednesday, November 9, 2011

NetApp Experience: MetroCluster Disk Fail

I'm working on a case for a MetroCluster right now.  The situation was started because the customer was doing power maintenance, and shut off one of the two PDU's in the rack.  In a metrocluster, between the system and the shelves there are redundant fibre channel switches, but in this case the switches had only 1 power supply per switch, but they were spread across the two PDU's.  This means that one of the two switches went offline during the work.


The system failed 9 disks in this case, and all 9 of them were being addressed over the switch that went down.  NGS has found iSCSI errors over the switch that stayed up during that time.  The only firmware that is backrev'd is ESH firmware, so we're gonna have to dig deeper for a solution.


Those 9 disks failing caused a RAID Group to fail, which caused an aggregate to fail.  We got the system back up and running by re-seating each disk slowly, with 90 seconds between each action.


Update:  Here's what NGS had to say. (Quote)

The ports on the brocade switches are not locked:
>>> [-1][01] CRITICAL [01][01][00]:switchname: Port[4] has loop port NOT LOCKED
The ports not being locked can cause a number of instability issues and is the most likely cause of the issue seen.  The information on how to lock these can be found in the following document: http://media.netapp.com/documents/tr-3548.pdf

 "Not enough power supplies are present...to satisfy disk drive and shelf power requirements." 
The logs are erroneous, there's been burts opened to correct this warning, but one PSU should not cause the disk shelf any issues other than it takes longer to get the disks spun up since it will do this in increments.

"Cluster monitor: takeover of eg-nascsi-a02 disabled (unsynchronized log)"
"Cluster Interconnect link 0 is DOWN"
It wouldn't surprise me if the syncmirror lost sync during this period based on the switch issues they experienced.  The ports not being locked can cause a large number of unusual errors.
(End Quote)


Making sense of it: Brocade's ports are categorized as E, F, L, G, etc.  An L port is a loop port, which means the switch will only initiate loop traffic.  Locking a port as an F port means that the switch won't begin treating the ports in a point-to-point relationship.  Directly from Brocade's documentation:
Credit: Brocade Fabric OS Reference 2.6


Here's a case of a NetApp customer working through this issue.

Predictions

Very soon, Facebook will become automatic. Your phone will keep track of the people you have encountered through the day in chronological order, and prompt you to friend them if you hang out with a person for a time or get very close to them.

Monday, November 7, 2011

NetApp Insights: Ifgrp

Few cool points on ifgrps:

-  Can you add a new port to an existing ifgrp?  Yes!  If the new port is not connected to something though, the it must be configured down at first.  It will come up in the ifgroup when you've plugged it in.  ifgrp add   .  
-  Can you remove a port from an existing ifgrp? Yes!  Not live though.  You must first configure the ifgrp down, and then you can ifgrp delete .  
-  How many switches can a ifgrp span?  I don't know.
-  Can you add a second IP to an existing vFiler?  Yes!  But each vFiler can live in only one IPspace.