In the past few weeks, we have blogged about Ampool’s view on how data processing needs have evolved over the past years, and how Ampool is on a mission to simplify the lives of application developers and data workers with a common data store.

The wait is over! Today, we’re making Ampool’s Active Data Store (ADS), generally available for developers to try out. Please download it from our website. It is free to use as a single instance, or in pseudo-distributed mode on your laptop, server, or a cloud instance.

As described in the previous posts, Ampool ADS is designed to enable real-time, data-driven applications with faster access to active data, resulting in timely insights. It is built on Apache Geode, a proven in-memory, object store, and is optimized for demanding analytical workloads.

The core binaries include all the dependencies needed to run core ADS services. Once these services are started, one can configure different connectors to store and analyze your data. Here are a few options to get started:

  • Working with a Spark shell (in standalone mode) to ingest, process and query your active data
  • Interacting with new or existing Hive tables through Beeline

All of this without changing the application logic, only a few data access commands.

For more information to get started with Ampool ADS, Spark & Hive connectors, please check out the docs here. In case, you have questions, please join/post those in the Ampool-Users Google Group here. We welcome your feedback!

Happy holidays!

[Acknowledgments: This blog post is the result of brainstorming with co-author Rohit Jain, CTO of our technology partner, Esgyn Corporation.]

This is the last post in the “Emerging Data Architectures” series outlining our vision of advancing the state of the art in data infrastructure.

In proprietary data management systems, there was no distinction between the query and storage engines since both parts were provided by the same vendor, and the proprietary implementation was very well integrated. The exception to that has been MySQL, which provides the ability for its query engine to integrate with multiple storage engines. However, that interface has been very limited. The NoSQL wave introduced a multitude of storage engines to support the key-value, Big Table, document, text search, and graph data models. Depending on the workload, one storage engine can work better in certain scenarios than others. Supporting a wide variety of operational and analytical workloads requires multiple data models – to process semi-structured and unstructured data, along with structured data in a relational model.This necessitates separation between the query and storage engines. The division of responsibilities between these engines may require some flexibility but could be roughly divided up as shown below:

Query Engine:

  • Allow clients to connect & submit queries
  • Distribute connections across cluster for load balance
  • Compile query
  • Execute query
  • Return results of query to client

Storage Engine:

  • Storage structure
  • Partitioning & automatic data repartitioning for load balance
  • Select columns (projection pushdown)
  • Select rows based on predicates (filter pushdown)
  • Caching writes and reads
  • Clustering by key
  • Fast access paths or filtering
  • Transactional support
  • Replication for high availability
  • Compression & encryption
  • Mixed workload support
  • Bulk data ingest/extract
  • Indexing
  • Colocation or node locality
  • Data governance
  • Security
  • Disaster recovery
  • Backup, archive, restore
  • Multi-temperature data support

The challenge for a query engine to support multiple storage engines is that each storage engine has very different APIs and capabilities. So a lot of effort has to be spent in exploiting all the capabilities provided by a storage engine and in compensating for the capabilities that the storage engine does not provide. It is not a trivial task to do this integration well.

To draw a parallel between query and storage engine integration, we provide an example of application-to-database interfaces. In 1992 the ODBC standard was introduced, followed by the JDBC standard in 1997. These standards facilitated the ease of integration between client tools and the various databases that were prevalent in the marketplace, and have proliferated ever since. The idea was that with a standard API to access databases, client tools using SQL could easily integrate with any database supporting that API. Of course, many client tools had to support database specific APIs only because these existed before the ODBC standard was introduced. Those APIs were more efficient than the vendor’s ODBC API implementation, which often did not map cleanly to the vendor’s proprietary API, thereby making it inefficient. Nevertheless, it was a huge step forward in providing a standardized interface that client tools could use for connecting to any database that supported it.

At this juncture, we are at a similar stage with the interface between the query and storage engines. While there are some similarities in the API requirements between these two scenarios, the integration between query and storage engines arguably goes far deeper. If all storage engines used the same interface (API) and provided some idea of what they supported and what they did not, via this API, it could be far easier for these storage engines to leverage a large variety of query engines. Vice versa, query engines could integrate with multiple storage engines far more easily in order to provide support for a large set of data models suited for specific use cases. Query engines could then federate across these storage engines at a much deeper level than what is possible by a query federator, such as Composite, today.

Towards Common Storage Engine APIs

Towards a Common Storage Engine API
A common storage engine API could more easily have facilitated the integration of query engines such as EsgynDB, Hive & Spark with key-value stores such as HBase to support low latency workloads and integrate with semi-structured data stored using the Big Table data model, ORC Files for columnar storage for BI and analytic workloads that scan large amounts of data, or Ampool for in-memory storage for real-time and streaming analytics, operational dashboards, temporary or dimensional tables that could speed up processing, and to host materialized views or aggregated tables for really fast processing of BI reports and analytical queries. On the other hand, if EsgynDB, Hive, and Spark integrated with multiple storage engines using a standard storage engine interface, Ampool could easily connect with any of these query engines using the same API, including a myriad of others such as Drill, Impala, and Presto. This would provide the user flexibility of choosing a query engine of their choice, while still leveraging Ampool underlying these query engines. Data stored in Ampool could be accessed via any query engine more easily than implementing point integrations with each of these engines. Just as with ODBC/JDBC, the query engine could provide Ampool-specific support in order to leverage its unique capabilities.

The areas where query engines have to integrate with storage engines are many. These are some of the most important ones.

  • Statistics
  • Key structure
  • Partitioning
  • Data type support
  • Projection and selection
  • Extensibility
  • Security enforcement
  • Transaction management
  • Metadata support
  • Performance, scale, and concurrency considerations
  • Error handling
  • Other operational aspects

These are discussed in more detail in an O’Reilly Data Report “In Search of Database Nirvana — The Challenges of Delivering Hybrid Transaction / Analytical Processing’‘ by Rohit Jain, CTO, Esgyn Corp.

A proposal for such an API can be put together by a team of open source community members providing query and storage engines, leveraging the JDBC API as a model. This API will provide interfaces needed to support the above-mentioned integration points, along with co-partitioned or re-partitioned parallel data exchange, and streaming. At Ampool, we will be working with the Open Source community to make this a reality.

In the previous posts in this series, we explained why a new storage subsystem is needed to effectively implement the butterfly architecture. Ampool has built exactly such an “Active Data Store”.

Ampool Active Data Store (ADS) is a memory-centric, distributed, data-aware, object store optimized for unifying streaming, transactional and analytical workloads. We describe each of these features of Ampool ADS next.

  • Memory-Centric: Although DRAM costs have been rapidly declining over the years, they are still very high compared to other storage media, such as SSD, and Hard Disk drives. Fortunately, not all the enterprise data that needs to be analyzed requires DRAM-level performance. Also, as the data becomes older, it is accessed less and less frequently. Thus, colder data can be stored on hard disk drives, warm data can be stored on SSDs, and only data that is most frequently accessed, and needs the fastest access can be stored in DRAM. Manually moving data across these storage tiers is cumbersome, and error prone. Ampool ADS implements smart tiering that monitors the usage of data, and automatically moves the data across tiers, as it gets accessed infrequently.
  • Distributed: Although the DRAM and storage density has increased dramatically over the years, constantly adding more DRAM and storage to a single system (i.e. scaling up), does not scale the overall system performance proportional to the cost. Thus, even memory-centric storage systems need to be clustered and distributed, for linear scalability, fault tolerance, and disaster recovery. Ampool ADS is designed as a distributed system from ground up. Data is replicated across the address spaces of various machines in the cluster in order to be highly available. In addition, the changes in the data are propagated via a scalable message bus across wide area network for disaster recovery.
  • Object Store: Historically, the most common types of storage systems were categorized into block stores or file systems. Each has its own advantages. A block store can be shared across different operating systems, and has a much lower overhead of accessing a random piece of data. However, network round-trips to fetch individual blocks is often inefficient for today’s large scale data workloads. In addition, since the basic unit of read and write is a 4KB block, small updates, as well as small reads result in a lot of unnecessary data traffic over the network, or on the local disks. Filesystems are the most commonly used abstraction for storage, and are available in various flavors across multiple operating systems. In addition, several scalable distributed file systems are available from multiple vendors. However implementing the file system semantics, which involves maintenance and navigation of a hierarchical name space structure, maintaining consistency and atomicity across file system operations, and providing random reads and writes in place in files, imposes a lot of overhead for the file system servers, as well as clients. Typically, the file system read/write access has 50-100 microsecond latency. When the file system was implemented on top of slow rotating disks, which had a 10 millisecond latency of its own, the file system latency was negligible compared to the underlying storage media latency. However, with the new fast random access storage, such as SSDs & NVRAM, which have only a few microseconds latency, the file system abstraction has overwhelmingly high overheads. In the last decade, because of emergence of public clouds, and their hosted storage solutions, a third kind of storage abstraction, Object Store, has become popular. Object stores organize data without deep hierarchy. In order to access an object, one only needs a bucket ID and an object ID, rather than navigating a hierarchical name space. In addition, Object stores have rich metadata associated with each object and bucket. So operations such as filters & content-searches, can be pushed down to the storage layer, reducing network bandwidth requirement, and load on the CPU. Object stores are ideal for the new classes of storage media because of the low CPU overhead, simpler semantics, and scalability, especially with large amount of data stored as objects.
  • Data-Aware: Most of the existing object stores do not interpret the contents of the objects natively, which limits its utility. Indeed, the most common use of object stores is as a BLOB (Binary Large Object) stores to store and retrieve multimedia, such as images or video. If one were to implement analytical workloads on data stored in an object store, it would need fetching the entire object (which may be megabytes or gigabytes in size) to the CPU, imposing a schema on it, deserializing it, and then performing the necessary analytical computations on it. The Ampool ADS stores extensive metadata about objects, such as schema, versions, partitioning key, and various statistics about the contents of the objects. It enables common operations such as projections, filtering, and aggregates to be pushed down to the object store. This helps in speeding up most analytical computations, and avoids network bottlenecks that are prevalent in other distributed storage systems.

Ampool Active Data Store, with Integrations with Compute Frameworks, and Other Storage Engines

In addition to the core memory-centric object store, the Ampool product includes several optimized connectors that allow existing computational engines to efficiently store and retrieve data from the Ampool object store. While the number of connectors is rapidly increasing with every version of Ampool, the current connectors provided out of the box include Apache Spark, Apache Hive, Apache Trafodion (in collaboration with Esgyn, Inc.), Apache Apex (in collaboration with Datatorrent, Inc.), and CDAP (Cask Data Application Platform, in collaboration with Cask Data, Inc.). Although the Ampool system is in itself a fully distributed storage system able to maintain large volumes of operational persistent data, it provides several persistent storage connectors to load data from and store data to. Connectors available include Hadoop Distributed File System (HDFS), Apache Hive (ORC), and Apache HBase. Ampool can be deployed as a separate system with Hadoop components, or with an existing running Hadoop cluster, either with Apache Ambari or Cloudera Manager. It can be monitored and managed with provided tools, or by connecting the JMX metrics produced by Ampool to any JMX-compatible monitoring system. By providing fast analytical storage for both immutable and mutable dataframes, and datasets, as well as for extensions to support event streams, Ampool provides the missing piece for implementing the butterfly architecture, and allows unification of various transactional and analytical workloads.

In the next post in this series, we will take a look at clean separation between computation frameworks, and storage systems for unified data processing as required by the butterfly architecture, and roles and responsibilities for each, to describe how Ampool ADS will evolve.

In order to efficiently implement the butterfly architecture, one needs a fast storage engine for data exchange across the data pipelines, streaming ingestion, and analytics. Optimized implementations for immutable and mutable dataframes are needed for allowing fast batch-oriented queries and fast transactions, which allow the co-existence of multiple workloads on a single system.

Traditional disk-based storage systems make this unification extremely difficult. However, the emergence of NVMe-connected Flash, NVDIMMS (Non-volatile dynamic memory modules), and a new class of persistent memory (SCM, or Storage Class Memory) provides a perfect storage medium in which high-throughput scan oriented workloads can co-exist with low-latency random access workloads. The table below characterizes the current & projected performance of various storage layers, along with their approximate cost.

Price/Performance Comparison of Storage Technologies

As we see from the table above, the current generation of DDR4 DRAM is the most cost-efficient for throughput oriented workloads (Bandwidth/ $ Cost), and the emerging Storage Class Memory (SCM) is the most cost-effective for random access workloads (Input/ output Operations Per Second/ $ Cost). Of course, media cost is not the only consideration for building systems. Storage density and power consumption are two other factors that need to be considered. Since the new SCM promises to have much higher densities & much lower power consumption than DRAM, they have the potential of becoming the primary storage layer for a fast unified data platform.

Most existing databases and data storage systems have been designed with the performance characteristics and storage densities of HDDs. Thus, they tend to avoid random access at all costs. In order to avoid long latencies, they tend to parallelize their random access workloads either by spreading data across multiple hard disk drives in a disk array, or by fetching all the data into expensive server-side DRAM, while running sequential access workloads on data stored on hard disk drives. Thus, they introduce a lot of complexity to keep the data consistent and available across workloads, in order to deal with disk failures. Also, hard disk drives having mechanical parts, are much more prone to failure than solid-state devices, such as Flash, SCM, and DRAM.

To fully implement the butterfly architecture, one needs to cost-efficiently utilize the various classes of solid state memory. In the next blog post in this series, we introduce Ampool Active Data Store, a novel memory-centric storage technology for implementing the butterfly architecture.