Interview: James Taylor, Salesforce on Apache Phoenix – RDBMS for Big Data

We discuss the beginning of Phoenix project, decision of making it open source, relational database layer on HBase, and key reasons for the superior performance of Apache Phoenix.

Twitter Handle: @hey_anmol

James Taylor
is an architect at Salesforce in the Big Data Group. He founded the Apache Phoenix project and leads its on-going development efforts. Prior to Salesforce, James worked at BEA Systems on projects such as a federated query processing system and a SQL-based complex event programming platform, and has worked in the computer industry for the past 20+ years at various start-ups. He lives with his wife and two daughters in San Francisco.

Here is my interview with him:

Anmol Rajpurohit: Q1. How and when did you (and Mujtaba) get the motivation to start Phoenix? Why the name "Phoenix"?

phoenixJames Taylor: Mujtaba and I started Phoenix about four years ago as an internal project at Salesforce to make it easy for people to leverage HBase which was being rolled out at as a complementary data store at Salesforce. It became clear pretty early on that we weren't going to get the adoption we wanted without a SQL front-end, as Salesforce is a big relational shop - no one wants to learn the new, proprietary, low level APIs of HBase.

In addition, we wanted to have front web applications serving up HBase data, so your standard map-reduce through Hive was not going to meet the latency requirements. Hence, Phoenix was born with the name based on a failed prior attempt to build our own big data stack from the group up. We certainly learned a lot and Phoenix rose from its ashes.

AR: Q2. What inspired to make Phoenix open source? How have the project priorities evolved since then?

apache-phoenix-logoJT: We quickly realized the general utility of Phoenix as an alternate, easier mechanism to access HBase data. Salesforce has always used a lot of open source (link) and this was a good opportunity to further advance our commitment to open source. It's also allowed Phoenix to grow more quickly as partners such as Intel and Hortonworks came on board to help round out the functionality.

Our priorities have always been driven by the Phoenix user community - that hasn't changed. As both the user and developer community has grown, however, we've had to put more processes in place to ensure stability: branching strategies, backward compatibility constraints, automated building and testing, and testing at scale.

AR: Q3. How do you compare HBase data model with the RDBMS model? What are the merits of having a relational database layer on top of HBase?

hbaseJT: HBase has a sparse data model, storing a separate cell per value that is set, while RDBMS typically have a dense model storing all values for a single row together. Both have pros and cons. In addition, HBase stores its data in immutable files which conceptually overlay each other to achieve mutability, while relational systems typically update the data in place. This model helps HBase scale horizontally more easily while making support for transactions more difficult.

AR: Q4. What are the major architectural components of Apache Phoenix? 

JT: Phoenix has a pretty typical query engine architecture with a parser, compiler, planner, optimizer, and execution engine. We push as much computation as possible into the HBase server which provides plenty of hooks for us to leverage.

AR: Q5. What are the key factors behind the impressive performance of Phoenix? What makes accessing HBase data with Phoenix faster than the native HBase API?

JT: Key factors behind the performance of Phoenix include:
  • parallelization of queries based on stats; HBase does not know how to chunk queries beyond scanning an entire region
  • pushing the processing to the server - most "by hand” API accesses do not use co-processors. This makes a huge difference for aggregation queries.
  • supporting and using secondary indexes with different flavors depending on the read/write workload of your application.
  • using "every trick in the book" based on various factors: the HBase version, metadata and query, reverse scans, small scans, skip scans, etc.
phoenix-performance AR: Q6. How has the sharp focus on HBase data impacted the progress of Phoenix?

phoenix-hbaseJT: On the one hand, our decision to interop with the existing HBase data model makes it easy for users to both get started as well as integrate with existing Hadoop-based infrastructure. On the other hand, we're relying on the HBase community to implement more efficient block-encoding schemes when the schema (i.e. set of column qualifiers) is known in advance. This improvement could make dramatic improvements on scan speed for OLAP applications written for Phoenix.

Second part of the interview