Sunday, January 17, 2016

Memory as a Service and Apache Spark

Apache Spark is quickly crowding out MapReduce as the framework for orchestrating data analytics on compute clusters like Hadoop.  It's much faster (~100x), requires less disk storage and throughput, is more resilient, and is easier to code for.

MapReduce accomplishes speed and resiliency by sending 3 copies of each piece of data to 3 different nodes.  Each node writes it to disk and then starts working, which means you're tripling the disk writes and tripling the space required to do a job.  One way to solve this is to use direct-attach storage arrays: you can connect 4+ Hadoop nodes to one array and then send only 2 copies of each piece of data, relying on the storage array's RAID protection for a layer of protection.  This also allows you to scale capacity and performance as needed.  Data storage companies see this architecture as their way to contribute and sell to the Big Data industry.

Source
But here comes Spark, which has "Resilient Distributed Datasets," essentially data parity so that you can always reconstruct the missing data if a node goes down.  Think erasure coding.  Since the data has a layer of resiliency built in, writing to disk isn't needed, and the work can be done in RAM.

This basically eliminates today's value proposition of data storage companies.  But it does require a bunch more RAM, which is expensive.  I'm starting to see potential for the Flash As Memory Extension developing market: connect an all-flash or 3dxpoint array to several nodes and you have a slower but much cheaper way to run Spark.  For use cases that don't need the blazing speed and have cost constraints, that could work well.

Memory as a shared resource...looks like we're heading toward a Memory as a Service model similar to RAMcloud!

No comments:

Post a Comment