The main intention of the butterfly architecture that we described in previous post is to unify various data processing tasks on a single platform. In order to implement the butterfly architecture, we need to treat data under management with new, more general abstractions that are different than current abstractions, such as files, directories, or tables, and indexes.
In the butterfly architecture, we organize data as linked collections of three types of abstractions:
Datasets: This is the most flexible abstraction, which is a partitioned collection of arbitrary records. Except for having a partitioning key, very little structure is imposed on records. In other words, interpreting what is in the records, is left to the processing framework, with the aid of a system catalog. This is equivalent to schema-on-read data, which is the only kind of data managed by current Hadoop/NoSQL data systems. The system catalog stores information about each dataset, as well as relationships among multiple datasets. Each dataset is given a unique identifier, and the catalog is a logical set of RDF triplets, denoted by (Relation, Object1, Object2). For example, to indicate that a dataset with ID D4596 is named SearchLog, the catalog has an entry (NameOf, “SearchLog”, D4596). As another example, to indicate the location of a dataset D4596 to be on HDFS, an entry (Location, D4596, “hdfs://namenode:port/user/data/something”) exists in the system catalog. Note that this is a logical representation of the system metadata about datasets, and may be represented physically as a set of fixed-width tables, for reasons of efficiency. These datasets could be stored on multiple storage systems. Even multiple partitions for a single dataset may be stored across multiple storage back-ends. In addition, when a dataset is stored as a stream of bytes in files, or transferred across a network, the serialization and deserialization format is user-defined or operator-defined.
Data frames: Dataframes are structured datasets. They are partitioned with a user-specified partitioning key, contained in the individual records. The dataframes could be mutable or immutable. Immutable dataframes may not be modified in any way once they are created, while individual records of a mutable dataframe could be inserted, updated, or deleted. Dataframes are typically created by multiple computation frameworks by pipelining processing stages. Initial inputs for these data processing piepelines are datasets. Dataframes are very similar to structured tables in relational database management systems, with a predefined schema. However, for modern workloads, dataframes must support richer data types, such as lists, maps, and structs, with the ability to traverse these complex nested types. Immutable dataframes are suitable for analytical workloads, whereas mutable dataframes are used for transactional CRUD workloads.
Event Streams: Event streams are unbounded dataframes. In this type of dataframe, at least one of the fields in the records (events) is mostly monotonically increasing. Usually, this field is a timestamp, or a sequence number. Optionally, streams may have a window size specified as either a number of records (in case the monotonically increasing field is a sequence number), or a time duration (in case the monotonically increasing field is a timestamp.) Within a window, there could be some out-of-order arrival of events. However, across windows, the sequence number or timestamp is strictly monotonically increasing.
In order to efficiently store and access these data abstractions, the core of butterfly architecture is the storage subsystem, which provides the desired characteristics of a flexible metadata store, efficient concurrent querying, and flexible transformations among these data abstractions. In the next post in this blog series, we will discuss the technology trends that are making this core storage subsystem possible.