Wednesday, January 13, 2016

Top of Rack Flash and FaME

One developing topic is Flash as Memory Extension, or FaME.  With flash becoming cheaper and cheaper and CPU's getting faster and faster, there's an architectural bottleneck to solve: there's a huge performance gap between the DRAM and the SAN.  

FaME tries to solve this by accessing SSDs in way that mimics access to RAM, skipping the SCSI stack and driving down latency.  This begins to close the DRAM-SAN gap, achieving latencies in the 500-900ns range.  There are a whole host of ways to accomplish this.
  • NVMe is one, essentially a PCI-slot compatible SSD.  Pop it in a server and it acts like DRAM.  The disadvantage here is you've re-introduced the problems the SAN/NAS was introduced to solve: stranded capacity, lack of data protection features (snapshots, replication), and you'll need to come up with a way to make this NVMe available to multiple nodes for parallelization and redundancy. Think Fusion-IO, which also required significant re-coding of applications.
    • RoCE Infiniband has port-to-port latencies ~100ns
    • RoCE Ethernet has port-to-port latencies ~230ns and RoCE v2 is routable.  This is a link-layer protocol, however RoCE v2 is not supported by many/all OS's yet. 
  • iWARP is a protocol that allows RDMA wrapped in a packet for a stateful protocol like TCP.  
  • Memcached is an open-source way for your servers to use other servers as DRAM extension.  I'm a bit fuzzy on whether it simply uses the second server as a place to put part of your current working set or if it offloads portions of the computation as well.  In any case, here's a good explanation.
  • HDFS, key value store semantics and other protocols: this may be the smartest way to do things, just let the application speak directly to a storage array the same way it would speak to RAM.  
Currently the fastest all flash SAN/NAS arrays have latencies in the 100,000-200,000ns range.  This is miles away from the DRAM, which is in the 6-20ns range.  FaME is aiming for 900ns.  Architectures like HANA and Spark try to solve this by putting the entire workload in DRAM, which is expensive and means that a power outage requires long process of re-loading the data from disk into DRAM.  
Courtesy of

Looks like FaME will be a good price/performance solution until we develop a super-cheap static RAM.

No comments:

Post a Comment