In order to hazard a guess about what the future of data architecture holds, let us take a brief tour of how we arrived at the current data architecture. Prior to the popularity of relational databases, multiple data models were being used in data systems. Amongst these, hierarchical & navigational data systems were used extensively on mainframe-based systems. Since the number of clients for these data systems were limited, they remained monolithic, and more often than not, were offered by the mainframe manufacturer and bundled with hardware.
As the relational model was proposed more than forty years ago, and was deemed suitable for a majority of data applications, it became very popular for prevalent use-cases in banking, insurance, and the financial services industries. Relational database systems became the default backend data system, as a store of record for a variety of verticals. The advent of client-server systems, where multiple clients would utilize the data stored and served by the same server, gave importance to up-front data modeling, standard query language (SQL), formal data manipulation semantics (ACID), query concurrency, rule-based and cost-based query optimization, standard access methods (ODBC and JDBC), and plethora of visual tools for building database-backed applications, data visualizations, etc.
Client access to these operational databases was a mix of CRUD (Create-Read-Update-Delete) primitives on either a single record, or a small number of records. In order to provide consistency across multiple CRUD operations, a notion of transactions was introduced, where either all the operations were carried out atomically, or none at all. These data systems were known as OLTP (On-Line Transactional Processing) systems, and their performance was measured in transactions-per-second.
Most business intelligence (BI) and reporting workloads used very different data access patterns. These queries were mostly read-only queries, on a large amount of historical data. Although the operational data systems were initially used to handle both transactional and analytical workloads, they could not meet the needs of low-latency transactions, and high-throughput analytics simultaneously. Thus, to serve this new class of applications, data systems specialized in OLAP (On-Line Analytical Processing) were devised. Since these OLAP systems had to handle large amounts of historical data, often they were built as MPP (massively parallel processing) systems on a shared-nothing distributed architecture. This created two silos of structured data in organizations. One for structured transactions, and another for structured analytics. Even though both these systems were designed with relational data models in mind, often one would need to integrate multiple transactional data stores across multiple departments to provide complete historical data for analysis. Thus, a notion
of periodic ETL (extract-transform-load) was born, which would capture data changes across multiple transactional data stores, map their relationships, and structure them into fact and dimension tables, with star or snowflake schema. The analytical query engines, and the storage for analytical data were quite different from their transactional counterparts.
Analytical data, once stored, would almost never have to change, as it was a historical record of business transactions.
In the world of structured operational and analytical data stores, semi-structured data (such as server logs) and unstructured data (such as natural language communication in customer interactions) were either discarded or were kept in an archival store for compliance reasons. Centralized file systems became a popular choice of data stores for semi-structured and unstructured historical datasets, with specialized access layers, such as keyword search.
Apache Hadoop aimed to solve the semi-structured and unstructured data analytics workloads problem, by providing a distributed file system (HDFS) on commodity hardware, and coupling that with a batch-oriented flexible data processing paradigm called MapReduce. As the Hadoop ecosystem expanded, it was utilized to tackle a larger variety of data processing workloads. Thus, there were scripting languages such as Apache Pig, SQL-like query languages such as Apache Hive, and a NoSQL store such as Apache HBase, that used HDFS as their persistence store.
Eventually, the compute resource management capability was separated away from the batch-oriented programming model (Apache Hadoop YARN), and allowed a proliferation of data processing frameworks to run on top of data stored in HDFS. These included traditional MPP data warehouses (such as Apache HAWQand Apache Impala), Streaming Analytics Systems (such as Apache Apex), and transactional SQL engines (such as Apache Trafodion). This gave rise to the notion of a Data Lake, where all the raw data from across the enterprise and external data sources would be loaded and made available for flexible analytics, using best of breed data processing engines on the same granular data.
As the concept of Data Lake started becoming popular, a few data architectures were proposed to combine various analytical data processing workloads for building end-to-end data processing pipelines. In the next part of this blog series, we will give an overview of these modern data architectures.