The main intention of the butterfly architecture that we described in previous post is to unify various data processing tasks on a single platform. In order to implement the butterfly architecture, we need to treat data under management with new, more general abstractions that are different than current abstractions, such as files, directories, or tables, and indexes.

In the butterfly architecture, we organize data as linked collections of three types of abstractions:

Datasets: This is the most flexible abstraction, which is a partitioned collection of arbitrary records. Except for having a partitioning key, very little structure is imposed on records. In other words, interpreting what is in the records, is left to the processing framework, with the aid of a system catalog. This is equivalent to schema-on-read data, which is the only kind of data managed by current Hadoop/NoSQL data systems. The system catalog stores information about each dataset, as well as relationships among multiple datasets. Each dataset is given a unique identifier, and the catalog is a logical set of RDF triplets, denoted by (Relation, Object1, Object2). For example, to indicate that a dataset with ID D4596 is named SearchLog, the catalog has an entry (NameOf, “SearchLog”, D4596). As another example, to indicate the location of a dataset D4596 to be on HDFS, an entry (Location, D4596, “hdfs://namenode:port/user/data/something”) exists in the system catalog. Note that this is a logical representation of the system metadata about datasets, and may be represented physically as a set of fixed-width tables, for reasons of efficiency. These datasets could be stored on multiple storage systems. Even multiple partitions for a single dataset may be stored across multiple storage back-ends. In addition, when a dataset is stored as a stream of bytes in files, or transferred across a network, the serialization and deserialization format is user-defined or operator-defined.

Data frames: Dataframes are structured datasets. They are partitioned with a user-specified partitioning key, contained in the individual records. The dataframes could be mutable or immutable. Immutable dataframes may not be modified in any way once they are created, while individual records of a mutable dataframe could be inserted, updated, or deleted. Dataframes are typically created by multiple computation frameworks by pipelining processing stages. Initial inputs for these data processing piepelines are datasets. Dataframes are very similar to structured tables in relational database management systems, with a predefined schema. However, for modern workloads, dataframes must support richer data types, such as lists, maps, and structs, with the ability to traverse these complex nested types. Immutable dataframes are suitable for analytical workloads, whereas mutable dataframes are used for transactional CRUD workloads.

Event Streams: Event streams are unbounded dataframes. In this type of dataframe, at least one of the fields in the records (events) is mostly monotonically increasing. Usually, this field is a timestamp, or a sequence number. Optionally, streams may have a window size specified as either a number of records (in case the monotonically increasing field is a sequence number), or a time duration (in case the monotonically increasing field is a timestamp.) Within a window, there could be some out-of-order arrival of events. However, across windows, the sequence number or timestamp is strictly monotonically increasing.

In order to efficiently store and access these data abstractions, the core of butterfly architecture is the storage subsystem, which provides the desired characteristics of a flexible metadata store, efficient concurrent querying, and flexible transformations among these data abstractions. In the next post in this blog series, we will discuss the technology trends that are making this core storage subsystem possible.

As Apache Hadoop and associated projects have rapidly evolved to promise a unified analytics platform, the key core component of Hadoop, the Hadoop Distributed File System, was never designed with interactive, streaming, and transactional workloads in mind, and has become a hindrance to achieving that goal.

As a result, multiple architectures, such as the Lambda architecture, and Kappa architecture were proposed, to unify multiple data processing workloads. Unfortunately, these have been quite difficult to implement.

In the Lambda architecture, the speed, serving, and batch layers need three different implementations of data processing frameworks, for the same functionality, depending on the speed and scale needed. It is the responsibility of the implementer to transport needed data across these three layers, which is a non-trivial task. In addition, one cannot guarantee data consistency across the three layers, without introducing yet another distributed transaction engine between the data applications and the Lambda architecture.

Lambda Architecture

The Kappa architecture, introduced by the creators of Apache Kafka, relies on a distributed log as the primary source of data. A distributed log provides decoupling between data producers and data consumers. It allows different batch and streaming analytics engines to consume data from the same data bus, thus solving the data consistency problem introduced in the Lambda architecture. But the difficulty of implementing data processing in the speed, serving, and batch layers in three different engines remains.

 

Kappa Architecture

We note that the primary difficulty in implementing the speed, serving, and batch layers in the same unified architecture is due to the deficiencies of the distributed file system in the Hadoop ecosystem. If a storage component could replace or augment the HDFS to serve the speed and serving layers, while keeping data consistent with HDFS for batch processing, it could truly provide a unified data processing platform. This observation leads to the butterfly architecture.

The main differentiating characteristics of the butterfly architecture is the flexibility in computational paradigms on top of each of the above data abstractions. Thus a multitude of computational engines, such as MPP SQL-engines (Apache ImpalaApache Drill, or Apache HAWQ), MapReduceApache SparkApache FlinkApache Hive, or Apache Tez can process various data abstractions, such as datasets, dataframes, and event streams. These computation steps can be strung together to form data pipelines, which are orchestrated by an external scheduler. A resource manager, associated with pluggable resource schedulers that are data aware, are a must for implementing the butterfly architecture. Both Apache YARN, and Apache Mesos, along with orchestration frameworks, such as Kubernetes, or hybrid resource management frameworks, such as Apache Myriad, have emerged in the last few years to fulfill this role.

The butterfly architecture, and associated data flows are illustrated below.

 

Butterfly Architecture

The main intention of the butterfly architecture is to unify various data processing tasks on a single platform. In order to implement the butterfly architecture, we need to treat data with new, more general abstractions that are different than current abstractions, such as files, directories, or tables, and indexes.

In the next blog post in this series, we will take a look at the minimum set of data abstractions that are needed to implement the butterfly architecture.

In order to hazard a guess about what the future of data architecture holds, let us take a brief tour of how we arrived at the current data architecture. Prior to the popularity of relational databases, multiple data models were being used in data systems. Amongst these, hierarchical & navigational data systems were used extensively on mainframe-based systems. Since the number of clients for these data systems were limited, they remained monolithic, and more often than not, were offered by the mainframe manufacturer and bundled with hardware.

As the relational model was proposed more than forty years ago, and was deemed suitable for a majority of data applications, it became very popular for prevalent use-cases in banking, insurance, and the financial services industries. Relational database systems became the default backend data system, as a store of record for a variety of verticals. The advent of client-server systems, where multiple clients would utilize the data stored and served by the same server, gave importance to up-front data modeling, standard query language (SQL), formal data manipulation semantics (ACID), query concurrency, rule-based and cost-based query optimization, standard access methods (ODBC and JDBC), and plethora of visual tools for building database-backed applications, data visualizations, etc.

Client access to these operational databases was a mix of CRUD (Create-Read-Update-Delete) primitives on either a single record, or a small number of records. In order to provide consistency across multiple CRUD operations, a notion of transactions was introduced, where either all the operations were carried out atomically, or none at all. These data systems were known as OLTP (On-Line Transactional Processing) systems, and their performance was measured in transactions-per-second.

Most business intelligence (BI) and reporting workloads used very different data access patterns. These queries were mostly read-only queries, on a large amount of historical data. Although the operational data systems were initially used to handle both transactional and analytical workloads, they could not meet the needs of low-latency transactions, and high-throughput analytics simultaneously. Thus, to serve this new class of applications, data systems specialized in OLAP (On-Line Analytical Processing) were devised. Since these OLAP systems had to handle large amounts of historical data, often they were built as MPP (massively parallel processing) systems on a shared-nothing distributed architecture. This created two silos of structured data in organizations. One for structured transactions, and another for structured analytics. Even though both these systems were designed with relational data models in mind, often one would need to integrate multiple transactional data stores across multiple departments to provide complete historical data for analysis. Thus, a notion
of periodic ETL (extract-transform-load) was born, which would capture data changes across multiple transactional data stores, map their relationships, and structure them into fact and dimension tables, with star or snowflake schema. The analytical query engines, and the storage for analytical data were quite different from their transactional counterparts.
Analytical data, once stored, would almost never have to change, as it was a historical record of business transactions.

In the world of structured operational and analytical data stores, semi-structured data (such as server logs) and unstructured data (such as natural language communication in customer interactions) were either discarded or were kept in an archival store for compliance reasons. Centralized file systems became a popular choice of data stores for semi-structured and unstructured historical datasets, with specialized access layers, such as keyword search.

Apache Hadoop aimed to solve the semi-structured and unstructured data analytics workloads problem, by providing a distributed file system (HDFS) on commodity hardware, and coupling that with a batch-oriented flexible data processing paradigm called MapReduce. As the Hadoop ecosystem expanded, it was utilized to tackle a larger variety of data processing workloads. Thus, there were scripting languages such as Apache Pig, SQL-like query languages such as Apache Hive, and a NoSQL store such as Apache HBase, that used HDFS as their persistence store.

Eventually, the compute resource management capability was separated away from the batch-oriented programming model (Apache Hadoop YARN), and allowed a proliferation of data processing frameworks to run on top of data stored in HDFS. These included traditional MPP data warehouses (such as Apache HAWQand Apache Impala), Streaming Analytics Systems (such as Apache Apex), and transactional SQL engines (such as Apache Trafodion). This gave rise to the notion of a Data Lake, where all the raw data from across the enterprise and external data sources would be loaded and made available for flexible analytics, using best of breed data processing engines on the same granular data.

As the concept of Data Lake started becoming popular, a few data architectures were proposed to combine various analytical data processing workloads for building end-to-end data processing pipelines. In the next part of this blog series, we will give an overview of these modern data architectures.

We are witnessing tremendous transformations in enterprise data architectures. The traditional workhorse, relational database management system (RDBMS), is being supplemented by large-scale non-relational stores, such as Apache Hadoop Distributed File System (HDFS), MongoDB, Apache Cassandra, and Apache HBase. While this transformation may seem overwhelming to a few, we see a more fundamental shift on its way, which demands many more changes to modern data architectures.

The current ongoing transformation was mandated by business requirements for the connected world, with the volume, variety, and velocity of data that resulted from ubiquitous connectivity for consumers, democratization of content, and advertising-supported business models. We believe that the next wave will be dictated by richer customer interaction, right-time business insights and operational cost optimization. This is augmented by transformative changes in the underlying infrastructure technology such as on-demand availability of computation, storage & networking in public and private clouds, ubiquitous cheap low-power sensors, and newer use-cases such as Internet of Things (IoT), Deep Learning, and Conversational User Interfaces (CUI).

In order to meet the challenges of these emerging use cases, one must be able to perform high-scale deep analysis and learning functions, near real-time decision making, and adjust quickly to new events or learnings. Therefore, it is a precondition that the modern data architecture needs to bring data analytics out of the back office and merge it with operational business systems.

The modern business imperative is to provide hyper-personalized experiences to consumers, based on the real-time context of that user’s interaction. This context is generated from all the available data about that user, about similar users, as well as external data sources that may have an influence on this user’s experience. Thus, all applications will become data-driven applications that are powered by closed-loop analytics. As a consequence, application developers will have to become data scientists and data scientists must have application development skills.

Systems for closed-loop analytics must face the challenge of supporting the deeper model building and analytics on data that was previously trapped in the back-office, with fast & flexible querying over large data sets, functional transformations and algorithms to support learning, fast handoff of data from one processing sub-system to another, and integration with large-scale analysis services in the cloud or on-premises. The friction of data replication and movement must be minimized.

Data Platform Layers

Data Platform Layers

These fast data analysis systems must interoperate at the object transformation service layer. In this transformation layer, a wide swath of common transformation functions, common query engines, and storage engines must be supported. Since a choice amongst best of breed data processing engines is desired to address different processing requirements, a robust way to support multiple storage layouts and fast data exchange within a distributed data store is required. Therefore, the modern data architecture, at its lowest level, requires the development of a smart storage system that handles the needs of multiple sub-systems running a smart backend for rich interaction and timely decision making.

In the next part of this blog series, we will take a brief tour of data platforms prevalent in the enterprises.

Welcome to Ampool Blog.

We have been hard at work building the next generation data infrastructure for the past several months. Our design partners and customers have been working closely with us during this period and providing active feedback about beta versions of Ampool product. We have learnt a lot about their challenges with building data-driven intelligent applications, and that has influenced our roadmap and focus on production-readiness tremendously.

We  are excited to come out of stealth mode with some exciting announcements about the fruits of our labour, and preparing to launch developer edition of Ampool Active Data Store soon.

We will start with a series of blog post about the reasons we decided to build Ampool, the challenges of building data infrastructure for next generation data-driven applications. Please check back often, or subscribe to the feed from Ampool blog.

Happy Ampooling.