In order to efficiently implement the butterfly architecture, one needs a fast storage engine for data exchange across the data pipelines, streaming ingestion, and analytics. Optimized implementations for immutable and mutable dataframes are needed for allowing fast batch-oriented queries and fast transactions, which allow the co-existence of multiple workloads on a single system.
Traditional disk-based storage systems make this unification extremely difficult. However, the emergence of NVMe-connected Flash, NVDIMMS (Non-volatile dynamic memory modules), and a new class of persistent memory (SCM, or Storage Class Memory) provides a perfect storage medium in which high-throughput scan oriented workloads can co-exist with low-latency random access workloads. The table below characterizes the current & projected performance of various storage layers, along with their approximate cost.
Price/Performance Comparison of Storage Technologies
As we see from the table above, the current generation of DDR4 DRAM is the most cost-efficient for throughput oriented workloads (Bandwidth/ $ Cost), and the emerging Storage Class Memory (SCM) is the most cost-effective for random access workloads (Input/ output Operations Per Second/ $ Cost). Of course, media cost is not the only consideration for building systems. Storage density and power consumption are two other factors that need to be considered. Since the new SCM promises to have much higher densities & much lower power consumption than DRAM, they have the potential of becoming the primary storage layer for a fast unified data platform.
Most existing databases and data storage systems have been designed with the performance characteristics and storage densities of HDDs. Thus, they tend to avoid random access at all costs. In order to avoid long latencies, they tend to parallelize their random access workloads either by spreading data across multiple hard disk drives in a disk array, or by fetching all the data into expensive server-side DRAM, while running sequential access workloads on data stored on hard disk drives. Thus, they introduce a lot of complexity to keep the data consistent and available across workloads, in order to deal with disk failures. Also, hard disk drives having mechanical parts, are much more prone to failure than solid-state devices, such as Flash, SCM, and DRAM.
To fully implement the butterfly architecture, one needs to cost-efficiently utilize the various classes of solid state memory. In the next blog post in this series, we introduce Ampool Active Data Store, a novel memory-centric storage technology for implementing the butterfly architecture.