Wishing our blog readers a very happy new year 2018.

In the first blog series of this new year, we will outline three broad patterns where Ampool Active Data Store (ADS) and In-Memory Platform is being used for real use-cases by our customers and pilots. In this post, we will describe the first usage pattern for Ampool, Near-App Analytics.

Previous State

Modern applications need to provide hyper-personalized experiences to their end users, so as to improve user-engagement by providing most-relevant information. In order to achieve this, the application needs to accurately model user behavior while interacting with the application, and tailor the presented information to the user activity, as opposed to a generic one-size-fits-all model. Previously, in data lake architectures, the user activity is logged by the applications, and stored in log files on application servers. Periodically, this collection of user activity logs are transported and ingested into a centralized data lake as a set of raw data files. Later, in batches, these raw data files are cleansed & denormalized by combining them with other reference datasets, and user activity sessions are created. Activity modelling algorithms using machine learning techniques are applied on a sliding window of time (for example, last 30 days) on these datasets, and the model is evaluated on previous known and labelled behavior of users to determine efficacy of the new model created. If the model is found to be effective & beneficial, then the model is uploaded into application servers, and is applied when the user next visits the application.

Deficiencies

As the user activity data is staged on “the edge”, i.e. the application servers, and goes through multiple transports, format conversion, batch ingestion etc to finally land in centralized data lake, before it is available for analytics, the business value of the data, and the actionability of insights that could be gained from analyzing this data rapidly diminishes with time. In addition, because of the delay between user activity, and the insights generated, the machine-learned user behavior models are often stale. Paul Maritz (Executive Chairman of Pivotal), succinctly described the core application of “Big Data” as ability of “Companies & Organizations to catch people or things in the act of doing something and affect the outcome.” Clearly, feeding stale insights back to applications implies losing the ability to catch users in the act of interacting with the application, and to affect the outcome. The core problem that needs to be solved, is to create “Real-Time, Personalized, Actionable Information, in Current Context”. This is where Ampool In-Memory Platform comes into picture.

“Companies need to learn how to catch people or things in the act of doing something and affect the outcome

-Paul Maritz, Executive Chairman, Pivotal

With Ampool

The block diagram above illustrates where Ampool platform is used for Near-App Analytics, with the data flow.

  1. Application emits data exhaust (user activity events) on a message queue (e.g. Apache Kafka)
  2. Ampool connector for message queue fetches, sessionizes, & denormalizes events in Ampool
  3. Real-time Analytics (model refinement, dashboards, anomaly detection) performed in Ampool
  4. Results of analytics (models, visualization, alerts) emitted on message queue
  5. Serving store is updated with results of analytics
  6. Application uses results of analytics
  7. Colder data persisted in data lake for historical analytics (e.g. large scale batch model training)

Why Ampool

The following features of Ampool enable Near-App Analytics seamlessly:

  • Native integration with message queues (e.g. Apache Kafka). An event stream published to a topic in the message queue immediately materializes as either a flowing table, or a materialized view in Ampool
  • Low-Latency (~20µs) lookups & in-place updates eliminate batch ETL, especially when reference (or dimensions) tables are already populated in Ampool
  • High throughput (~6.5 Billion 1Kilobyte events/day/node) ingestion & linear scalability of a distributed platform for most demanding applications
  • Programmable co-processors  (triggers) for data cleansing, transformations, & denormalization are executed pre- and post- operations on data
  • Sub-second analytics, with projection pushdown, filter pushdown, & data-local computations
  • Efficient integration with analytical frameworks (e.g. Apache Spark, Apache Hive, Presto), eliminates need to learn new analytical computation frameworks
  • Change Data Capture (CDC) stream listener to push analytics results on message queue (or serving store)
  • Seamless & configurable tiering to colder storage in Hadoop-native file formats (Apache Parquet & Apache ORC) for historical analytics

In addition to the above mentioned features, Ampool can be deployed in familiar application deployment frameworks, such as Docker containers, and can be orchestrated using modern platforms such as Kubernetes, the reducing the need to learn new/unproven platforms.

If you are using Pivotal Gemfire (or Apache Geode) for caching needs of your applications, you would find Ampool ADS (Powered by Apache Geode) a natural fit for your near-app analytics needs.

When to consider Ampool for Near-App Analytics?

You should consider deploying Ampool for your near-app analytics needs, if:

  • Your scale-out applications generates huge volumes of structured or semi-structured data rapidly (several billions of events per day)
  • The delay between user activity and resulting insights decreases the business value of the data generated
  • Your applications’ serving data stores (OLTP RDBMS, or NoSQL K-V Store) is not able to perform rapid analytics
  • You cannot “close the loop” for analytics, because of various reasons, including multiple slow staging environments, batch ETL, batch analytics etc.

ICYMI (from December 2017, which was “so last year”):

If you are interested in exploring Ampool, write to us to schedule a demo.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *