From a nascent Apache project in 2006 to being commercially supported data platform by two public companies, Apache Hadoop has come a long way. Initially adopted by Web 2.0 companies, such as Yahoo, Facebook, LinkedIn & Twitter, Hadoop-based data platforms started becoming a major part in enterprise data infrastructure starting 2011. Today, a majority of Fortune-500 companies have adopted Hadoop data lakes. Maturity of Hadoop-native SQL Query engines (such as Hive & Impala), availability of Hadoop-based BI Platforms & Machine Learning platforms (Apache Spark) have made Hadoop data lakes more accessible to traditional enterprise data practitioners. In addition, availability of hosted Hadoop services on all major public cloud providers have allowed anyone with a credit card to start working with Hadoop, reducing the skills gap.

Having had the fortune of developing Hadoop from day one, while I was at Yahoo, I have witnessed various stages of Hadoop adoption in the enterprises. Most enterprises adopt Hadoop-based platform for its cheap scale-out storage (HDFS), as an “Active Archive” or a staging data store. The second phase of Hadoop adoption is to migrate workloads that can be easily parallelized, and moved away from expensive data warehouses, the most common example being expensive Extract-Transform-Load workloads. The third phase is to perform advanced analytics at scale. Based on  conversations with various customers and prospects over the last three years, we believe that most enterprises are somewhere between stage 2 & 3.

Previous State

Hadoop data lake (either on-premises or in clouds) is where the raw data from other parts of the enterprise first land. These raw datasets may consist of web-server & application server logs, transaction logs from various front-end databases, periodic snapshots from dimensional datasets, events & metrics from hardware or software sensors. IT operations (Data Ops) are responsible for ensuring that the raw data is loaded in a timely manner in these data lakes, and that the data cleansing workloads create “source of truth data sets”. This is the first stage of the ETL pipeline (sometimes called ELT, as raw data is loaded first, and then transformed).

Many transformations on these source-of-truth data sets involve complex joins across multiple datasets, imposing structure on semi-structured/unstructured data (flattening nested data), segmentation (across geographies, demographics, products, time intervals). After these transformations, multiple analyzable data sets are created. Next, pre-written business intelligence queries, and reporting queries are run on these newly created datasets, and the results are uploaded to different systems for serving them to business analysts.

For organizations that have begun to utilize advanced analytics (deep analytics, machine learning etc), these workloads consisting of long-running chains of Hadoop jobs are launched as “cron jobs”. Since most of the data lakes are operated & managed by IT operations, the analytics workloads are carefully vetted to make sure that they execute correctly, and that the data lake has enough capacity to execute those workloads in timely manner, and that these workloads do not bring down the entire cluster or other mission-critical workloads with misbehavior.

Challenges & Deficiencies

In addition to the “production data workloads” that run regularly (say, every hour or every day), there are many other workloads in any organization, which are ad hoc in nature. (Merriam Webster dictionary defines ad hoc as “for the particular end or case at hand without consideration of wider application“.) Many data explorations, visualizations, data science, and machine learned model building workloads are ad hoc, as these workloads do not have a set frequency, nor are they carefully tuned to extract performance.

As Hadoop data lakes become single source of truth for data, various teams in the organization need access the Hadoop data lake in order to perform these ad hoc workloads. These workloads, being experimental in nature, are typically time-bound. A team may run experiments on the subset of data from the data lake for a few weeks or months, and then either throw away their work (in case of unsuccessful experiment), or may productionize it (if successful, and needed long term). The main challenge for the Hadoop data lakes is co-existence of mission-critical production data pipelines, and ad hoc workloads on the same data lakes.

In the early Hadoop deployments at Yahoo, we had to maintain several separate Hadoop clusters, some reserved for production workloads, and some for ad hoc analytical use cases. However, keeping the data synchronized between the two remained a challenge (DistCP, a tool used to copy data between different Hadoop clusters, was the biggest consumer of resources among all users). In addition, because of the burstiness of the ad hoc workloads, either query latencies were unpredictable, or  cluster utilization was very low. This situation continues even today, among several enterprises.

Prior to advent of Hadoop data lakes, when a data warehouse was the single source of truth for enterprises, a similar challenge was solved by creating multiple “data marts“, typically, one per division within an organization, to allow ad hoc analytics on a subset of data away from data warehouse. In this usage pattern, Ampool performs the same role of  data mart, where Hadoop data lake is the single source of truth.

With Ampool

The above block diagram shows the deployment & data flow, when Ampool is used as a data mart for Hadoop data lakes.

  1. User specifies tables, partitions, ranges, and fields from data lake catalog
  2. Ampool executes queries on data lake to bulk load requested data
  3. (Or) Ampool bulk loads requested Hadoop files directly, enforcing schema on read
  4. Ampool presents loaded data as partitioned tables of structured/semi-structured data
  5. Analyst uses Hadoop-native query engines on Ampool to interactively query data
  6. (Or) Uses Ampool connectors for Python/R to access data
  7. Most tools & frameworks that integrate with Hadoop query engines can be used
  8. (Optionally) operational data is integrated in real-time (e.g. slow-changing dimension tables)
  9. Results published & shared through data lake

Why Ampool?

Several features in Ampool enable this usage pattern.

  • Native integration with Hadoop-native query engines to perform parallel loads from data lakes (Hive, Spark etc.)
  • Integration with Hadoop authentication & authorization (Kerberos, LDAP, Apache Sentry)
  • High speed data-local connectivity for Hadoop-native query engines (Hive, Spark, Presto, EsgynDB)
  • Both Row-oriented & Column-oriented memory & disk layouts to efficiently execute queries
  • Support for Hadoop-native file formats (ORC & Parquet)
  • Polyglot APIs (Java, REST, Python, R)
  • Native integration with Apache Kafka & Apache Apex to rapidly ingest operational data from other data sources
  • Linear scalability & automatic load balancing at high performance
  • Smart bidirectional tiering to local disks, extending memory beyond working set
  • Hadoop-native deployment, management & monitoring (e.g. Cloudera Manager, Apache Ambari)

In addition, Ampool can be deployed on demand using docker containers, and can be orchestrated with popular engines such as Kubernetes. (Deployment with Mesosphere DCOS is on the roadmap.) Since the data stored in Ampool is dynamically updateable, as the underlying data in the Hadoop data lake is changed, those updates can be reflected in-place on the Ampool data mart.

In near future, we are working on simplifying populating the Ampool data mart even more. Our goal is to make it as simple as working with a source code repository (e.g. git checkout, git pull, and git push).

When to consider Ampool as a high-speed data mart for your Hadoop data lake?

You should consider Ampool as an augmentation to your Hadoop data lake, if:

  • Your users are demanding access to the production Hadoop data lake for performing ad hoc analytics.
  • Your users are already familiar with any of Hadoop-native query engines or compute frameworks, and do not want to learn new query languages only for performing ad hoc analytics.
  • These ad hoc analytics workloads, if successful, may need to be deployed some day on production data lakes, using the same Hadoop-native compute frameworks.
  • Your business analysts are impatient, and won’s wait months for IT to carve out separate Hadoop clusters, and make data available to them.
  • Your users demand predictable latencies for their interactive workloads, and you do not want to spend excessive budget by over-provisioning your data lake.
  • You want your ad hoc analytics users to follow the same authentication & authorization & data governance policies as your production data lake.

If you are interested in exploring Ampool for this use case, write to us to schedule a demo.

Almost all applications are powered by one or more persistent data stores. Relational OLTP DBMSs have been most often used as data serving backends. However, in recent years, for large scale-out applications, non-relational (NoSQL) data stores, that sacrifice consistency for availability, are being used for data serving. Most modern mobile or web-based applications’ user experience on the front end is powered by making multiple API calls to the backend services which are in turn backed by multiple data serving platforms.

Previous State

Typical application architectures are composed of multiple tiers (and sometime with load balancers & proxies between these tiers), with applications making API calls to the application servers, and application servers translating these API calls to multiple requests to appropriate data serving systems in the backend, collecting responses from the data serving backends, combining them using business logic, and sending them back to the application, where the application renders it on the front-end. When there is temporal locality (data that is fetched recently, will be fetched again soon), the data serving backends are augmented with a lookaside in-memory caching layer, such as Redis of MemcacheD in order to reduce load on persistent data serving systems.

Challenges & Deficiencies

As number of applications, and number of concurrent users of each application increases, the data serving systems are overwhelmed by the number of concurrent requests, and they tend too spend a lot of time (query latency) waiting for individual requests to complete. As a single application request may result in multiple requests to underlying data stores, the application response is blocked for the last response from the data stores. These data fetch requests can be arbitrarily complex, and may result in sequential scan of all records in the data store, exhibiting a long tail behavior.

We identify the core problem in such systems as inability to efficiently construct a logical view over multiple data stores, serving these views concurrently to multiple applications, allowing these views to be updated, and propagating these view updates to underlying data stores.

Adding in-memory caching layer to each underlying data store take advantage of temporal locality, and speeds up data access from each individual data store, however, the application server still has to execute business logic and combine responses (or issue additional requests), before data is returned to the application. In addition, since the single-node lookaside cache is not fault-tolerant, when the cache goes down, applications will see severely degraded performance. A number of distributed & fault tolerant caching systems, such as Apache Geode, Apache Ignite, and Hazelcast aim to solve the cache availability issues. These distributed caching systems are essentially distributed in-memory hash tables with a primary key, and an arbitrary value, and provide fast key-based lookups on this hash table. However, they do not solve the fundamental issue of serving updateable, materialized view that may require fast short scan-oriented queries.

With Ampool

Above block diagram illustrates the data flow, when Ampool is used as Application Acceleration middleware.

  1. Ampool pre-loads initial datasets
  2. Construct in-memory materialized views, by embedding business logic as stored procedures
  3. Application requests data from View access layer
  4. View access layer parses request, adds metadata, and forwards to Ampool
  5. Ampool checks authorization at view-level, and serves projection/subset of view
  6. View access layer performs additional transformations
  7. Response is sent back to Application
  8. In case of updates, Ampool asynchronously updates the underlying databases

Why Ampool?

Ampool has many features that enable it to be used as multi-tenant application acceleration platform.

  • Transparent, and customizable bulk loader functionality, can use JDBC, NoSQL (e.g. MongoDB), or File System APIs
  • Low-Latency (~20µs) lookups & in-place updates allow fast construction of materialized views, and synchronous updates
  • Pre-processors (similar to database triggers) allow customization of fine-grained authorizations
  • Row versioning enables fast lock-free inserts & updates
  • High throughput parallel query to rapidly serve materialized views with projection & filter pushdowns
  • Change Data Capture (CDC) mechanism to asynchronously update underlying tables
  • High query concurrency (~100s of concurrent clients) at low (sub-millisecond) latency
  • JDBC Connectivity with Apache Calcite integration

Please note that Ampool does not enforce distributed transactional updates across multiple non-collocated tables by default. However by integrating Ampool clients with external transaction managers (such as Apache Omid or Apache Tephra) one can implement transactional behavior (only if needed.)

When to consider Ampool for Application Acceleration?

You should consider Ampool as middleware for application acceleration if:

  • Your applications need to serve data concurrently from multiple RDBMS or NoSQL data stores (including file systems)
  • Your application servers are getting overwhelmed by business logic that results in sequential scans on underlying data stores that exhibit long-tail behavior of such queries
  • Your business logic is written in Java (current limitation of Ampool, will go away in future)
  • Your data model consists of both structured & semi-structured data
  • Working set of your multi-tenant applications does not fit within a single machine’s memory

If you are interested in exploring Ampool, write to us to schedule a demo.

Wishing our blog readers a very happy new year 2018.

In the first blog series of this new year, we will outline three broad patterns where Ampool Active Data Store (ADS) and In-Memory Platform is being used for real use-cases by our customers and pilots. In this post, we will describe the first usage pattern for Ampool, Near-App Analytics.

Previous State

Modern applications need to provide hyper-personalized experiences to their end users, so as to improve user-engagement by providing most-relevant information. In order to achieve this, the application needs to accurately model user behavior while interacting with the application, and tailor the presented information to the user activity, as opposed to a generic one-size-fits-all model. Previously, in data lake architectures, the user activity is logged by the applications, and stored in log files on application servers. Periodically, this collection of user activity logs are transported and ingested into a centralized data lake as a set of raw data files. Later, in batches, these raw data files are cleansed & denormalized by combining them with other reference datasets, and user activity sessions are created. Activity modelling algorithms using machine learning techniques are applied on a sliding window of time (for example, last 30 days) on these datasets, and the model is evaluated on previous known and labelled behavior of users to determine efficacy of the new model created. If the model is found to be effective & beneficial, then the model is uploaded into application servers, and is applied when the user next visits the application.

Deficiencies

As the user activity data is staged on “the edge”, i.e. the application servers, and goes through multiple transports, format conversion, batch ingestion etc to finally land in centralized data lake, before it is available for analytics, the business value of the data, and the actionability of insights that could be gained from analyzing this data rapidly diminishes with time. In addition, because of the delay between user activity, and the insights generated, the machine-learned user behavior models are often stale. Paul Maritz (Executive Chairman of Pivotal), succinctly described the core application of “Big Data” as ability of “Companies & Organizations to catch people or things in the act of doing something and affect the outcome.” Clearly, feeding stale insights back to applications implies losing the ability to catch users in the act of interacting with the application, and to affect the outcome. The core problem that needs to be solved, is to create “Real-Time, Personalized, Actionable Information, in Current Context”. This is where Ampool In-Memory Platform comes into picture.

“Companies need to learn how to catch people or things in the act of doing something and affect the outcome

-Paul Maritz, Executive Chairman, Pivotal

With Ampool

The block diagram above illustrates where Ampool platform is used for Near-App Analytics, with the data flow.

  1. Application emits data exhaust (user activity events) on a message queue (e.g. Apache Kafka)
  2. Ampool connector for message queue fetches, sessionizes, & denormalizes events in Ampool
  3. Real-time Analytics (model refinement, dashboards, anomaly detection) performed in Ampool
  4. Results of analytics (models, visualization, alerts) emitted on message queue
  5. Serving store is updated with results of analytics
  6. Application uses results of analytics
  7. Colder data persisted in data lake for historical analytics (e.g. large scale batch model training)

Why Ampool

The following features of Ampool enable Near-App Analytics seamlessly:

  • Native integration with message queues (e.g. Apache Kafka). An event stream published to a topic in the message queue immediately materializes as either a flowing table, or a materialized view in Ampool
  • Low-Latency (~20µs) lookups & in-place updates eliminate batch ETL, especially when reference (or dimensions) tables are already populated in Ampool
  • High throughput (~6.5 Billion 1Kilobyte events/day/node) ingestion & linear scalability of a distributed platform for most demanding applications
  • Programmable co-processors  (triggers) for data cleansing, transformations, & denormalization are executed pre- and post- operations on data
  • Sub-second analytics, with projection pushdown, filter pushdown, & data-local computations
  • Efficient integration with analytical frameworks (e.g. Apache Spark, Apache Hive, Presto), eliminates need to learn new analytical computation frameworks
  • Change Data Capture (CDC) stream listener to push analytics results on message queue (or serving store)
  • Seamless & configurable tiering to colder storage in Hadoop-native file formats (Apache Parquet & Apache ORC) for historical analytics

In addition to the above mentioned features, Ampool can be deployed in familiar application deployment frameworks, such as Docker containers, and can be orchestrated using modern platforms such as Kubernetes, the reducing the need to learn new/unproven platforms.

If you are using Pivotal Gemfire (or Apache Geode) for caching needs of your applications, you would find Ampool ADS (Powered by Apache Geode) a natural fit for your near-app analytics needs.

When to consider Ampool for Near-App Analytics?

You should consider deploying Ampool for your near-app analytics needs, if:

  • Your scale-out applications generates huge volumes of structured or semi-structured data rapidly (several billions of events per day)
  • The delay between user activity and resulting insights decreases the business value of the data generated
  • Your applications’ serving data stores (OLTP RDBMS, or NoSQL K-V Store) is not able to perform rapid analytics
  • You cannot “close the loop” for analytics, because of various reasons, including multiple slow staging environments, batch ETL, batch analytics etc.

ICYMI (from December 2017, which was “so last year”):

If you are interested in exploring Ampool, write to us to schedule a demo.