IT engineering and a little bit of hacking: February 2016

Tuesday, February 23, 2016

Performance Archive Part II

More on performance archiving: it collects data at the highest granularity possible for each statistic, often 1 second. The time range of each dataset is up to 6 hours long, and each node preserves its own set of this performance data.

There is nothing like this in 7mode (rolling perf data collection) on-box. You’d have to run perfstats.

NetApp has tools to analyze the data, but they are not customer facing. You provide the case number in the command line and it auto-uploads.

You can specify the exact time window you’re looking for:

cs001::> system node autosupport invoke-performance-archive -node cs001-pn01 -start-date "02/22/2016 8:00:00" -end-date "02/20/2016 14:00:00"

There is minimal impact on the netapp system of running this – it already runs in the background and keeps up to 28 days of data.

Friday, February 19, 2016

SolidFire

SolidFire is a "make private cloud easy" solution primarily designed for service providers. It's a "born in OpenStack" all-flash whitebox solution that aims to be stupid-easy to deploy and manage.

The goal for SolidFire is not to be the fastest, the most resilient, or the most features. It aims to answer one question, best in class: "How do I easily deploy Storage as a Service?" You can see this in their design choices:

Because this is a product service providers sell, they're flash only, have required QOS policies, and skip all the management tools, leaving that to OpenStack.
Because they use two copies of everything instead of RAID, they achieve node level resiliency and skip expensive hardware and software, using inline dedupe/compression to recover the space delta. This also spreads performance requirements across the entire cluster.
Because they expect you'll be deploying a single configuration thousands of times, they support only 1 protocol and have very limited configuration options.
Because this is for a cloud, not a single-purpose, the cluster (up to 100 nodes) auto-grows when you add a new node and recovers quickly when you lose one.

A few technical details:

Platform today is Dell servers. Now that Dell owns EMC, it'll probably convert to Cisco.

10 drives per node
SF2405: 5-10TB and 50k IOPS
SF4805: 10-20TB and 50k IOPS
SF9605: 20-40TB and 50k IOPS
SF9010: 20-40TB and 75k IOPS

Features:

Inline dedupe and compression
For QOS you can set min, max, and burst limits.
Mix any node platform
You can hot remove nodes
iSCSI, FCP (with a gateway device)
native snapshot capability and can backup to any Amazon Web Services S3 or OpenStack SWIFT-compatible API.

Under the hood:

Nodes are connected via 10GbE over your shared network. Not a private intracluster network.
“All connections for a particular LUN presented to storage go back to the primary node for that LUN. IE: multipath doesn't help you weather a failover. They're dependent on long iSCSI timeouts to give them time to fail a node and redirect traffic.”

Performance and QOS: http://www.solidfire.com/resources/provision-control-and-change-storage-performance-on-the-fly
Node Loss Demo: http://www.solidfire.com/resources/demonstration-of-solidfires-automated-self-healing-ha

SolidFire wins Gold in the Storage magazine/SearchStorage.com 2015 products of the Year Storage Systems: All-Flash Systems category. http://searchsolidstatestorage.techtarget.com/feature/SolidFire-SF9605

CDOT Tip: Performance Troubleshooting

NetApp is engineering simpler, more elegant tools for our clients to manage their large technology deployments. We’ve found recently that clients are relatively unaware of one tool that can make your life much easier when you are investigating a performance issue: Performance Archives.

Performance Archives has existed in ONTAP for a long time, but beginning with 8.3 (released November 2014), the payload was updated to specifically enable diagnostic use cases, trending use cases, etc. In 8.3, there is a command “system node autosupport invoke-performance-archive” which allows customers (HTTPS enabled) to ship back up to 6 hours worth of data at a time, collected a much higher resolution than off-box PerfStat ever did (per-second for many counters) AND allows you to “go back in time” up to 28 days, depending on customer configuration.

The tool we recommended pre-8.3, PerfStat, will continue to function through 8.4 per current plan but we recognize it is post-failure collection mechanism, which is not ideal. In other words, after you run into a performance issue, you have to set up PerfStat to run and wait for that issue to recur. Performance Archives gives you the ability to instantly look back several hours and catch what the problem in the act.

We’re also making big improvements to Performance AutoSupports: We are focused on efficiently streaming real-time performance information to our cloud and making it easier to access this content within the client-facing NetApp ASUP infrastructure. This will allow for our customers and NetApp support engineers to do trending, analytics, diagnostics and more.

And of course OnCommand Performance Manager remains your go-to tool for retaining, graphing, and machine learning analysis for your entire NetApp footprint.

Read more here: https://kb.netapp.com/support/login/p_login.jsp?page=content&id=3014366&locale=en_US

Thursday, February 18, 2016

Cloud ONTAP

We’ve all seen the cloud coming for years, but now suddenly our clients are finding hyperscalers compelling and maybe essential. We encourage our clients to use the public cloud wherever it’s optimal and we’d love to partner with customers as they make the transition. One of the ways we can help is with Cloud ONTAP.

Retaining the security, standards, and integrity of your data as you start to use the cloud isn't easy. If your data is already on NetApp, it's a cinch to deploy NetApp's OS in the cloud and replicate the data over, solving all those problems.

Cloud ONTAP is purchasable in two forms:

Pay as you go – By directly from AWS and pay per hour.
Bring Your Own License: 6 or 12 month “everything included” license quoted by NetApp.

I heard NetApp can put together a 48 month license quote if needed.
Includes up to 368TB of capacity.

Neither of these options include the cost of AWS, which depends on several factors, including:

Which instance size you choose (M4.4KL, M4.2XL, M3.XL, M3.2XL, R3.XL, R3.2XL, etc). This is a menu of RAM/CPU combos.*
If you’re using NetApp aggr encryption, you can only choose M4.2XL
What type of disk you chose (SSD or HDD).
How much capacity you need.
How utilized you expect the CPU to be.
How much data you expect to be transferred out of the instance. Transfer into AWS is free.

You can use the AWS cost calculator to estimate the cost: http://calculator.s3.amazonaws.com/index.html

*Testing indicates a M3.2XL instance caps out at ~10k IOPS (100% read) and 1.5k IOPS (100% write) at 20ms latency. M4.4XL should accomplish roughly 2x the performance of M3.2XL.

In most situations, using storage in the cloud is going to cost you 3+ times the acquisition cost of the same on-prem storage array over a 4 year life. After accounting for datacenter costs and hardware management though, you may find the TCO comparable. NetApp offers a free cloud workshop to help you identify which workloads are great cloud candidates to help you sort through your real requirements and costs. Feel free to reach out to your friendly neighborhood netapp engineer or myself if you're interested!

All technical information is as of CDOT 8.3.2, CDOT 8.4 is expected to have big improvement for Cloud ONTAP.