Evaluating HTAP Databases for Machine Learning Applications

Businesses are producing a greater number of intelligent applications; which traditional databases are unable to support. A new class of databases, Hybrid Transactional and Analytical Processing (HTAP) databases, offers a variety of capabilities with specific strengths and weaknesses to consider. This article aims to give application developers and data scientists a better understanding of the HTAP database ecosystem so they can make the right choice for their intelligent application.



Oracle Exadata is the most deployed high-end OLTP system and it has been extended with new OLAP capabilities with an in-memory columnar capability. Exadata is fully ACID compliant with all isolation levels, and Oracle uses a hybrid representation where data is inserted in tuples (row-based) and then converted to in-memory columnar representations. Oracle is strictly a scale-up engineered solution with excellent performance but at a very high cost. This system is not open source and Oracle does offer it in a database-as-a-service model.

MemSQL is primarily used for analytical workloads. It is ACID compliant and handles transactional updates, but these are not optimized for large-scale real-time OLTP workloads. MemSQL is not typically used to power real-time, concurrent applications. MemSQL uses a hybrid representation where data is inserted in tuples (row-based) and then converted to columnar. MemSQL is a scale-out solution typically deployed on commodity clusters, and it distributes computation across the cluster and can scale. This system is not open source and is not available through a DBaaS offering.

Splice Machine is a fully ACID compliant MVCC, capable of powering applications including native Oracle PL/SQL. OLAP computation takes place on Apache Spark while transactional queries take place on Apache HBase. Native Data is persisted in Apache HBase. Data can also be persisted in external Parquet and ORC columnar files. Splice Machine is a scale-out solution whose cost-based optimizer distributes work via Apache Spark or Apache HBase. This system is open source and will be available as a DBaaS within the first quarter of 2017.

Apache Hive is primarily used for analytical workloads. It is ACID compliant and handles transactional updates, but these are not optimized for large-scale real-time OLTP workloads. Hive is not typically used to power real-time, concurrent applications. One Hive application can add rows while another reads from the same partition without interfering with each other. This system uses native file-based storage on the Hadoop File System. Older systems stored raw data on HDFS but newer systems use Apache Parquet or ORC columnar formats. These columnar storage systems compress data and perform very well. Hive is a scale-out solution typically deployed on commodity clusters. It relies on the Hadoop file system and Map-Reduce computations. This system is open source and is available as a DBaaS through Quoble, Amazon and Google.

Apache HAWQ is primarily used for analytical workloads. It is ACID compliant and handles transactional updates but these are not optimized for large-scale real-time OLTP workloads. HAWQ is not typically used to power real-time, concurrent applications. HAWQ stores data in multiple formats on HDFS including Apache Parquet. HAWQ is a scale-out solution typically deployed on commodity clusters. It distributes computation across the cluster and can scale. This system is open source and is available as a service through Pivotal.

Apache Trafodion is a SQL-on-HBase solution intended to support full OLTP workloads with a two-phase commit protocol. Apache Trafodion uses HBase as a persistent row-based store. Apache Trafodion is a scale-out solution that distributes work across a cluster of execution agents that distribute work across HBase region servers. This system is open source but is not available as a DBaaS.

Apache Kudu/Impala is primarily used for analytical workloads. It is ACID compliant and handles transactional updates but these are not optimized for large-scale real-time OLTP workloads. Kudu is not typically used to power real-time, concurrent applications. Apache Kudu is a hybrid key-value store that has both a tuple-based and columnar store. Kudu uses a hybrid representation where data is inserted in tuples in a write-optimized LSM-Tree and then converted to columnar using Apache Parquet. The Apache Kudu/Impala Systems are scale-out systems that leverage parallel computation across the cluster and vectorize instructions as much as possible. This system is open source but is not available as a DBaaS.

HTAP systems combine transactional and analytical capabilities. Most HTAP systems are able to accept transactional updates and analyze, while some HTAP systems can actually power concurrent applications while performing analytics. The diversity between systems continues as some HTAP systems are scale-up, expensive solutions while others are scale-out less expensive solutions and some are open source and some are proprietary. Organizations can balance these requirements to scale traditional applications and make them intelligent.

Bio: Monte Zweben is a technology industry veteran. Monte’s early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. Monte then founded and was CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit. In 1998, Monte was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA. Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings. Zweben currently serves as Chairman of Rocket Fuel Inc. as well as on the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science.

Related: