We are excited to announce the Ampool 2.0 release which is an important milestone for Ampool. This release includes many new features and important performance  improvements.

Following major features are added:

FTable Storage Format

The FTable employs block strategy for its in-memory layer that groups multiple records together and minimizes per record storage overhead and the overall memory footprint. Thus, for append-only fact tables, logs or metrics, it helps you to make optimal usage of available memory. With 2.0 following three different formats are supported.

  • AMP_BYTES : The rows are serialized using Ampool specific encoding and are stored individually.  This may consume more memory but there is no additional overhead during scan to retrieve a row.
  • AMP_SNAPPY: The rows serialized using Ampool specific encoding and compressed using Snappy compression upon reaching the specified block size. This will help reducing memory usage but all rows will have to be decompressed during scan.
  • ORC_BYTES: All rows from a block will be converted to ORC columnar format upon reaching the specified block size.  Then each block will contain binary ORC data, representing the rows, which will be interpreted during scan.

FTable Delta update:

FTable stores multiple records together and need to propagate the update to the replicated copies and persistence layer. Following two types of delta propagation are added in this release.

  • Delta propagation to replicated copies: In case of updates to the same block, entire block is replicated and this incurs a major network overhead. With this release both single append and batch append operations that are updating the block, only the appended records are propagated to the replicated copies. This reduces the network overhead and improves the performance for the append operations to the FTable.
  • Delta propagation to persistence layer: As FTable stores rows in each bucket inside blocks of rows, all the ingestion operations (append/batch append) update the same block until the block size is reached. In case of updates to the same block complete block value used to be written to the disk rather than just an update to the block. This causes the lot of disk writes and compaction. With this release, only the updated records to the block are propagated  to the persistence layer and this reduces the disk writes and compaction.

Security Enhancements:

Ampool supports authorization to control access to data by authenticated user. The admin can control access to a table data depending on the identity of the user attempting the operation. Following two types of authorization are supported

  • Sentry Authorization :  Apache Sentry is a centralized store of authorization data and enforce fine grain role based authorization to data stored in data storage system such as Ampool ADS.
  • LDAP Authorization : Ampool ADS  leverages LDAP server for user authentication and authorization. In this use case, the LDAP server creates and manages users, and no information about users is stored on the ampool. Group/role information is managed both on the LDAP server and in Ampool.

Column Statistics :

For FTable, column statistics per block are generated and stored with the block. This helps in skipping the unwanted blocks during scan using filters. The stored statistics are min and max per column. These are updated with each append/batch-append. The statistics are stored for these data types: INT, LONG, BYTE, SHORT, FLOAT, DOUBLE, DATE, TIMESTAMP, STRING. The column statistics can help eliminate having to scan or decompress the block completely when no matching row could be found in the block.

Also, following known issues from previous releases are fixed in this release:

  • Provide functionality of deleting all the versions of all the keys qualified by given filter list without having to provide the key list.
  • MASH: Add a command to show table distribution on data and buckets on both primary and secondary copies.
  • Support for lowercase types names in table schema.
  • Server scan performance improvement.

Performance Improvements with delta replication

Configuration:

Single Append operation on FTable
column-length=100
num-columns=10
redundancy=3 
Number of buckets : 113
FTable Block Size : 1000

Number of Rows

Append Time in Seconds

(with Delta Replication)

Append Time in Seconds

(without Delta Replication)

Speedup
20000 16 320 ~20 Times
200000 164 3202 ~20 Times
2000000 1688 32030 ~19 Times

Performance Improvements with delta persistence

Configuration:

Servers Nodes: 8
Heap size per server:  50GB
Number of buckets : 113
FTable Block Size : 1000
Client Batch Size: 1000
Redundancy=3

Number of Rows (Size) With Delta Persistence Without Delta Persistence Difference
  Ingestion Time(sec)
five parallel clients
Total 
size on Disk (GB)

Total Heap
Size (GB)

Total Writes
on disk

(iostat)(GB)

Number of
oplog  files
created
(total files created)

Ingestion Time(sec)
five parallel
clients

Total Disk
size (GB)
Total Heap
Size (GB)

Total  Writes
on disk

(iostat)(GB)

Number of oplog  files
created
(total files created)
%Reduction wrt time %Reduction wrt size on disk %Reduction wrt to disk writes %Reduction wrt number of  oplogs files
50Million (40GB) 949 72 69.6 59.67 72 1168 104 70.4 90.76 1211 18.75 30.76 34.25 94.05
50Million (40GB) 945.8 72 70.4 59.66 72 1127 104 70.4 92.93 1225 16.07 30.76 35.79 94.12
50Million (40GB) 941.2 72 69.6 59.66 72 1146.2 104 70.4 91.66 1202 17.88 30.76 34.91 94.00

 

Release notes are updated at http://docs.ampool-inc.com/core/RN2.0.0/

Click here to download the Ampool release 2.0.

Our Open Source project Monarch will be updated soon with Ampool 2.0 release changes, and this release will also be available on AWS Marketplace (both single node, and cluster mode) in early 2018.

Public cloud, once primarily used by agile startups, has taken the enterprise IT infrastructure by storm. Instant fulfillment, a wealth of services, pay-as-you-go model (in addition to reserved resources), multiple deployment options (bare metal, virtual machines, containers, and functions) have attracted application developers in startups and large companies alike to the public clouds.

At Ampool, we have been using Amazon’s public cloud, AWS, for development, testing, benchmarking, proof of concepts, & sharing artifacts from day one, thanks to the generous AWS Activate grant. While we have designed Ampool Active Data Store (ADS) as a cloud-agnostic in-memory computing platform, and have tested it on other public clouds, such as Google Cloud Platform (GCP), and IBM’s Softlayer, customer inquiries about Ampool’s availability on AWS topped among all the other public clouds.

As a result, we are announcing today that Ampool ADS is now available on AWS Marketplace.

We are keeping our commitment that the single node version for development and testing will be free forever, and have listed a Free Single Instance AMI (EC2 Charges may apply) at https://aws.amazon.com/marketplace/pp/B077D81DD1. The single node version should not be used for production, since it lacks several capabilities for high availability & fault tolerance.

For production deployments, we have listed Ampool Cluster Version at https://aws.amazon.com/marketplace/pp/B0784YHDW8, with a 31-day free trial. Ampool Cluster edition is based on CloudFormation Template, which allows a single click deployment of Ampool ADS on EC2 instances.

Currently, version 1.5 of Ampool ADS is listed on AWS. We will be upgrading it to version 2.0, when we finish stress testing it.

Documentation of Ampool ADS for both on-premises and AWS deployments is available at http://docs.ampool-inc.com/.

Try it out, and send us feedback.

From relatively obscurity two decades ago, Open Source Software has come a long way, and has become a dominating force in enterprises. Most modern data platforms, both operational and analytical, are built with OSS projects, such as Hadoop, Cassandra, MongoDB, Spark, and Kafka. In our experience, many traditional enterprises in financial services, telecom, manufacturing, and many other verticals have taken an “Open Source First” approach. In addition, enterprise workloads moving from on-premises to public or private clouds are evenly divided between proprietary services provided by cloud vendors and open source software, either hosted by commercial vendors, or self-deployed and managed. In Mary Meeker’s 2017 “State of the Internet Report”, cloud-proprietary services’ lock-in is cited as a concern by 22% of enterprises, and is rapidly growing. Therefore, hosted or self-managed services powered by open source software has become the choice of enterprises on public clouds.

Ampool’s Open Source Lineage

Ampool’s Active Data Store (ADS) is powered by Apache Geode (previously Pivotal’s proprietary Gemfire In-Memory Data Grid.) As Chief Scientist at Pivotal, I was deeply involved in defining Pivotal’s OSS strategy for its Data products, and facilitated open sourcing of Pivotal Gemfire as an Apache project, along with Ampool’s Technical Advisory Board member, Roman Shaposhnik.

Currently, Ampool’s Hitesh Khamesra, & Avinash Dongre are on the Project Management Committee (PMC) for Apache Geode, and Ampool has employed five committers for Apache Geode. In addition, Suhas Gogate, our Chief Solutions Architect, and I have been long-term contributors to Apache Hadoop ecosystem projects.

Ampool uses Apache Geode as a foundation, and all our additions are built on top of Geode. Thus, the entire functionality of Geode In-Memory Data Grid, which is an in-memory caching layer and an object store for applications is included and enabled in Ampool. As of now, Ampool ADS is strictly a superset of Apache Geode, with minimal changes to Apache Geode to enable various additions to make it into a performant & robust analytical memory-centric store.

In addition,  many data access and persistence connectors to and from Ampool are built with OSS query engines, such as Apache Spark, Apache Hive, Apache Kafka, Apache Apex, Apache Trafodion, Cask Data Analytics Platform, and Apache Hadoop Distributed File System with OSS file formats such as Apache ORC, and Apache Parquet.

Ampool believes in the superiority of OSS as a distribution model, which reduces friction in adoption for developers of data platforms. However, we are also focused on building a viable business, which will allow us to rapidly innovate, and meet the cutting-edge requirements of our customers, and provide them with ease of use, secure deployment, painless management. One of the primary reasons Ampool ADS was developed for the last two years as proprietary, closed-source addition to Apache Geode, is because the speed of development in an open source community is significantly reduced due to the consensus-driven approach, and would not have been satisfactory for the rapid development pace needed for a startup catering to cutting-edge needs of our customers and prospects.

Having demonstrated that Ampool can meet those needs for our customers, I am pleased to announce today, that Ampool Active Data Store, along with data access & ingestion connectors to several OSS query engines, and data ingestion frameworks, is available as project “Monarch” on Ampool’s open GitHub repository at https://github.com/ampool/monarch under Apache License (ASLv2).

Project “Monarch” currently contains code for both the Active Data Store, and Connectors for Hive, Kafka, Presto, & Spark. More connectors, such as Apache Calcite, HDFS (ORC & Parquet) will be released into OSS soon.

Currently, the OSS Monarch is based on Ampool v 1.5. We are working hard on releasing Ampool 2.0 version soon, which will be merged upstream into Monarch immediately after the 2.0 release.

If you only want to try out Ampool as a binary distribution for single node, you can download it, as before, from http://www.ampool.io/product.

Documentation for usage and deployment can be found at http://docs.ampool-inc.com/

If you need support using Ampool (powered by Monarch) email us at  support [at] ampool [dot] io.

In addition, we have a discussion group at https://groups.google.com/forum/#!forum/ampool-users.

Looking forward to feedback & contributions.