This is the last post in the “Emerging Data Architectures” series outlining our vision of advancing the state of the art in data infrastructure.
In proprietary data management systems, there was no distinction between the query and storage engines since both parts were provided by the same vendor, and the proprietary implementation was very well integrated. The exception to that has been MySQL, which provides the ability for its query engine to integrate with multiple storage engines. However, that interface has been very limited. The NoSQL wave introduced a multitude of storage engines to support the key-value, Big Table, document, text search, and graph data models. Depending on the workload, one storage engine can work better in certain scenarios than others. Supporting a wide variety of operational and analytical workloads requires multiple data models – to process semi-structured and unstructured data, along with structured data in a relational model.This necessitates separation between the query and storage engines. The division of responsibilities between these engines may require some flexibility but could be roughly divided up as shown below:
- Allow clients to connect & submit queries
- Distribute connections across cluster for load balance
- Compile query
- Execute query
- Return results of query to client
- Storage structure
- Partitioning & automatic data repartitioning for load balance
- Select columns (projection pushdown)
- Select rows based on predicates (filter pushdown)
- Caching writes and reads
- Clustering by key
- Fast access paths or filtering
- Transactional support
- Replication for high availability
- Compression & encryption
- Mixed workload support
- Bulk data ingest/extract
- Colocation or node locality
- Data governance
- Disaster recovery
- Backup, archive, restore
- Multi-temperature data support
The challenge for a query engine to support multiple storage engines is that each storage engine has very different APIs and capabilities. So a lot of effort has to be spent in exploiting all the capabilities provided by a storage engine and in compensating for the capabilities that the storage engine does not provide. It is not a trivial task to do this integration well.
To draw a parallel between query and storage engine integration, we provide an example of application-to-database interfaces. In 1992 the ODBC standard was introduced, followed by the JDBC standard in 1997. These standards facilitated the ease of integration between client tools and the various databases that were prevalent in the marketplace, and have proliferated ever since. The idea was that with a standard API to access databases, client tools using SQL could easily integrate with any database supporting that API. Of course, many client tools had to support database specific APIs only because these existed before the ODBC standard was introduced. Those APIs were more efficient than the vendor’s ODBC API implementation, which often did not map cleanly to the vendor’s proprietary API, thereby making it inefficient. Nevertheless, it was a huge step forward in providing a standardized interface that client tools could use for connecting to any database that supported it.
At this juncture, we are at a similar stage with the interface between the query and storage engines. While there are some similarities in the API requirements between these two scenarios, the integration between query and storage engines arguably goes far deeper. If all storage engines used the same interface (API) and provided some idea of what they supported and what they did not, via this API, it could be far easier for these storage engines to leverage a large variety of query engines. Vice versa, query engines could integrate with multiple storage engines far more easily in order to provide support for a large set of data models suited for specific use cases. Query engines could then federate across these storage engines at a much deeper level than what is possible by a query federator, such as Composite, today.
Towards a Common Storage Engine API
A common storage engine API could more easily have facilitated the integration of query engines such as EsgynDB, Hive & Spark with key-value stores such as HBase to support low latency workloads and integrate with semi-structured data stored using the Big Table data model, ORC Files for columnar storage for BI and analytic workloads that scan large amounts of data, or Ampool for in-memory storage for real-time and streaming analytics, operational dashboards, temporary or dimensional tables that could speed up processing, and to host materialized views or aggregated tables for really fast processing of BI reports and analytical queries. On the other hand, if EsgynDB, Hive, and Spark integrated with multiple storage engines using a standard storage engine interface, Ampool could easily connect with any of these query engines using the same API, including a myriad of others such as Drill, Impala, and Presto. This would provide the user flexibility of choosing a query engine of their choice, while still leveraging Ampool underlying these query engines. Data stored in Ampool could be accessed via any query engine more easily than implementing point integrations with each of these engines. Just as with ODBC/JDBC, the query engine could provide Ampool-specific support in order to leverage its unique capabilities.
The areas where query engines have to integrate with storage engines are many. These are some of the most important ones.
- Key structure
- Data type support
- Projection and selection
- Security enforcement
- Transaction management
- Metadata support
- Performance, scale, and concurrency considerations
- Error handling
- Other operational aspects
These are discussed in more detail in an O’Reilly Data Report “In Search of Database Nirvana — The Challenges of Delivering Hybrid Transaction / Analytical Processing’‘ by Rohit Jain, CTO, Esgyn Corp.
A proposal for such an API can be put together by a team of open source community members providing query and storage engines, leveraging the JDBC API as a model. This API will provide interfaces needed to support the above-mentioned integration points, along with co-partitioned or re-partitioned parallel data exchange, and streaming. At Ampool, we will be working with the Open Source community to make this a reality.