The previous post discussed about the mutable data in Ampool Active Data Store (ADS). Here we will discuss about how Ampool ADS enables you to deal with very large volume of immutable data. The FTable stands for “flow-table” and it enables fast ingestion of very large amount of immutable data (aka facts data). The data is internally stored in multiple storage tiers and moved across seamlessly whenever required. The storage tiers are typically arranged in the ascending order of latency and decreasing order of the cost (per GB) associated with it. The newer data is relatively accessed much more frequently as compared to the older data. Thus, the high-demand data is stored in a tier with lowest latency whereas the data with lower demand is stored in the tiers with higher latency but lower cost.
The Ampool FTable nicely complements MTable in data warehousing use cases where mutable dimension data stored in MTable vs large ever growing facts data is stored in FTable with tiered storage. Separating these two types of tables and keeping the FTable primarily immutable we avoid the need for background data compaction and thus use SSD/Flash storage more effectively by eliminating write amplification phenomenon.
Similar to MTable, FTable also provides partitioning, redundancy, persistence and support for table schema with basic and complex data types. The data is partitioned across the cluster nodes using hash based partitioning using specified column values as partition key. The table could also be configured with redundancy for availability & fault tolerance. It also provides you with an option to persist the in-memory data to a persistent local disk storage so that it can be recovered in case of node failure/restarts.
Being immutable table, at high level it only supports the append and scan operations (Note: Although, to curate occasional data inconsistencies it does provide administrative operations like bulk delete/update with arbitrary filter criteria). The append enables you to ingest the data at very high ingestion rates. Either single record or bulk insertion of records is possible. A typical key-value store has a constant overhead per entry. The FTable employs block strategy for it’s in-memory layer that groups multiple records together and minimizes per record overhead and the overall memory footprint. Thus, for a typical append-only kind of tables it helps you to make optimal usage of available memory. .
FTable support INSERT_TIME as an in-built column for users to efficiently retrieve the data over a specified time range without having to scan the entire table. When the data is ingested, Ampool records the insertion-time along with each record. This insertion-time is used internally as an implicit order to store the data within a partition. This helps to boost the real-time range queries based on the insertion-time e.g. get all data for past N hours.
With FTables you could configure a hierarchy of tiers. In a recommended configuration first tier is always a memory-tier, second tier is shared nothing local disk/SSD based, while third tier is an archive tier using shared stores such as HDFS/S3 etc. FTable can also be configured to move the data from one tier to next tier based on specific policies. Both time and space-usage based policies could be configured per tier, per table. As data exceeds the time or space threshold, it is automatically moved to next tier. This allows you to make more optimal usage of the available resources in the respective tiers. Ampool FTable uses open data storage formats e.g. ORC, Parquet to store the data on persistent tiers such as local disk, HDFS/S3 etc.
The scan is another key operation on FTable. It allows you to retrieve the selected records by applying a set of arbitrary filters based on specific columns. The scan is a seamless operation that spans across all the configured tiers, as applicable. It retrieves the data from all the tiers transparently. The scan operation utilizes the data locality by applying the filter where the data is and only the matching results are sent over the network.
Modern applications that supports both real-time access and historical analysis, the recent data is accessed more frequently in a real-time manner as compared to the older data that gets utilized in batch oriented way. FTable’s tiered storage policy serves both types of workloads very effectively eliminating the need of multiple types of data stores used in a lambda architectural patten.
Similar to MTable, FTable also supports MASH Shell, Java APIs and connectors with multiple compute frameworks such as Apache Spark, Apache Hive for programatic access to tiered data store. In subsequent posts we will deep dive into details of these access mechanisms.