As Apache Hadoop and associated projects have rapidly evolved to promise a unified analytics platform, the key core component of Hadoop, the Hadoop Distributed File System, was never designed with interactive, streaming, and transactional workloads in mind, and has become a hindrance to achieving that goal.

As a result, multiple architectures, such as the Lambda architecture, and Kappa architecture were proposed, to unify multiple data processing workloads. Unfortunately, these have been quite difficult to implement.

In the Lambda architecture, the speed, serving, and batch layers need three different implementations of data processing frameworks, for the same functionality, depending on the speed and scale needed. It is the responsibility of the implementer to transport needed data across these three layers, which is a non-trivial task. In addition, one cannot guarantee data consistency across the three layers, without introducing yet another distributed transaction engine between the data applications and the Lambda architecture.

Lambda Architecture

The Kappa architecture, introduced by the creators of Apache Kafka, relies on a distributed log as the primary source of data. A distributed log provides decoupling between data producers and data consumers. It allows different batch and streaming analytics engines to consume data from the same data bus, thus solving the data consistency problem introduced in the Lambda architecture. But the difficulty of implementing data processing in the speed, serving, and batch layers in three different engines remains.

 

Kappa Architecture

We note that the primary difficulty in implementing the speed, serving, and batch layers in the same unified architecture is due to the deficiencies of the distributed file system in the Hadoop ecosystem. If a storage component could replace or augment the HDFS to serve the speed and serving layers, while keeping data consistent with HDFS for batch processing, it could truly provide a unified data processing platform. This observation leads to the butterfly architecture.

The main differentiating characteristics of the butterfly architecture is the flexibility in computational paradigms on top of each of the above data abstractions. Thus a multitude of computational engines, such as MPP SQL-engines (Apache ImpalaApache Drill, or Apache HAWQ), MapReduceApache SparkApache FlinkApache Hive, or Apache Tez can process various data abstractions, such as datasets, dataframes, and event streams. These computation steps can be strung together to form data pipelines, which are orchestrated by an external scheduler. A resource manager, associated with pluggable resource schedulers that are data aware, are a must for implementing the butterfly architecture. Both Apache YARN, and Apache Mesos, along with orchestration frameworks, such as Kubernetes, or hybrid resource management frameworks, such as Apache Myriad, have emerged in the last few years to fulfill this role.

The butterfly architecture, and associated data flows are illustrated below.

 

Butterfly Architecture

The main intention of the butterfly architecture is to unify various data processing tasks on a single platform. In order to implement the butterfly architecture, we need to treat data with new, more general abstractions that are different than current abstractions, such as files, directories, or tables, and indexes.

In the next blog post in this series, we will take a look at the minimum set of data abstractions that are needed to implement the butterfly architecture.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *