Wednesday, September 7, 2011

NetApp Insights: MetroCluster

My main criticism of NetApp's MetroCluster implementation is the same as this guy's; it has single points of failure.

Let's rewind a bit.  NetApp has a product called a Fabric MetroCluster, in which you pretty much pick up one controller out of your HA pair and move it to another datacenter (I'm simplifying things).  It's a good implementation in that it spreads the reliability of a single system out across two datacenters and replicates in real time.  It's a bad implementation in that it's still a single system.

Everything can fail, so in SAN, the name of the game is redundancy.  This is why customers buy TWO controllers when they purchase a HA system, even though both controllers are going into the same datacenter: each controller has ~6 single points of failure, and if it goes down, you still need your data to be served.  By providing a redundant controller, you can lose a controller and your customers won't even notice.  That's why we refer to a HA clustered pair as a single system: the cluster is a unit, a team.

You don't have the same luxury (without massive expense) when you spread your cluster across two datacenters.  The reason you geographically locate your SAN system in the same datacenter as your servers is that there's a large amount of traffic going back and forth from the SAN system to the servers.  Trying to pump all that traffic through an inter-site link (ISL aka inter-switch link) requires a serious pipe, which is very expensive.

If your SAN system goes down, the DR plan is typically to failover the clustered servers at the same time as the SAN system, a complex and often risky procedure.  By failing both over, you ensure the traffic does not need to travel over the ISL, which would likely create latencies beyond the tolerances of your applications.  But a better solution is to make sure your SAN system is redundant in the first place so you don't need to fail over.

This is why NetApp's current MetroCluster implementation falls short: it has 6 single points of failure that would require you to either push all traffic through your ISL or fail EVERYTHING over.  That's not, in my opinion, enterprise-class.

Good news though - looks like NetApp might be planning on fixing this to allow a clustered pair at each datacenter.  

No comments:

Post a Comment