How to Make Your Database 200x Faster Without Having to Pay More
Waiting long for a BI query to execute? I know it’s annoyingly frustrating… It’s a major bottle neck in day-to-day life of a Data Analyst or BI expert. Let’s learn some of the easy to use solutions and a very good explanation of why to use them, along with other advanced technological solutions.
Now the next question is how do you do create these samples? And how do you measure the accuracy? I have written a book chapter about this which you can read and learn how to do this yourself. But there are automated ways of doing this too. There are several products out there which you can use to hide this complexity from you, so you can just press a button and they do it for you A-Z by returning extremely fast answers and some of them even give you knob which you can turn to decide how much accuracy to trade for how much speedup:
- BlinkDB / G-OLA
Although there were many AQP proposals out there, BlinkDB was perhaps the first distributed (massively parallel) AQP engine that became an open-source project. In the spirit of full disclosure, I was involved in this project, and so I may be biased here since I really liked BlinkDB approach and I think it really shaped many of the academic and industrial proposals that came after. The work on BlinkDB was continued by Databricks (the company commercializing Apache Spark). A while ago, Databricks announced an extension to BlinkDB which would incrementally refine its answers until the user is satisfied. This extension was called G-OLA. G-OLA has never been publicly released and BlinkDB has not been updated for a long time now.
SnappyData is an open source in-memory hybrid analytics platform that offers OLTP, OLAP and Streaming all in a single engine. The database engine itself is built by extending Apache Spark (and is hence fully compatible with Spark). The in-memory core also offers AQP through curated stratified sampling and other probabilistic structures. Its query syntax is similar to BlinkDB, which allows users to specify their desired accuracy. This means you can treat accuracy as a dial. For example, you can ask for 100% if you want an exact answer (which is the default behavior), but if you want a quick answer, you can also get a 99% accurate answer within a second or so. In my opinion, a key advantage of SnappyData is that it uses curated stratified samples. What this means is that you can run your analytical queries in a couple of seconds even if you’re querying terabytes of data, running queries on a laptop, or are running in a shared cluster with tens of other concurrently queries. It also has built-in support for streaming, which allows you can built samples and have them updated in real-time in response to incoming streams.
Another nice feature in SnappyData is that it comes with a number of high-level user interfaces, which means you do not need to be proficient in statistics to use its AQP features. For example, they now have a free cloud service, iSight, that uses Apache Zeppelin as a front end to visualize query responses instantly while running the full query in the background.
Disclosure: I am an advisor to SnappyData.
Facebook’s Presto had an experimental feature for approximating basic aggregate queries. I don’t really know if this feature is now in their latest release or not, but the downside was that you would have to use a different syntax (i.e., modify your SQL query) to be able to invoke those approximate aggregate features. This is troublesome if you have existing BI tools or applications that are not using this special syntax, as they won’t be able to benefit from the potential speedup unless they’re re-written to use this new syntax.
InfoBright offers approximate query features (called IAQ). Unlike other systems, IAQ does not use samples at all. Unfortunately there is not much published about the details of how their approximate features work and what accuracy guarantees they offer, but from reading their blogs, I believe they are building models of the underlying data and using those to answer queries instead of using samples. Again, I don’t know much more about IAQ as they’re not open-source and I wasn’t able to find many details on their website, but they sound like an interesting approach.
Analytical Bootstrap System (ABS)was another approximate query enginethat would use samples and an efficient statistical technique for estimating the error. The latest code is a bit outdated and only works for earlier versions of Apache Hive. The project is not currently active.
Verdict is a middleware that sits between your application or BI tool and your backend SQL database. You can simply issue the same queries on the existing database as before and get approximate answers right away. In principle, it is possible to use Verdict with any SQL database, which means you won’t be restricted to any specific DBMS. But currently, it only comes with drivers for Spark SQL, Hive and Impala. The advantage is that it is generic and can be made to work with any SQL database and that is it open-source, and the downside is that since it’s a middle-ware it probably isn’t as efficient as some of the commercial offerings like InfoBright or SnappyData.
Disclosure: I was the designer for Verdict.
- Oracle 12C
Oracle 12C has introduced approximate count distinct and approximate percentile. These approximate aggregations improve performance and use less memory in their computation. Oracle 12C also provides materialized view support so that users can even pre-compute their approximate aggregates. Although approximate percentiles and count distinct queries are quite useful and common in practice, there is no general support for other types of queries. But given Oracle’s last user base, even these limited features will benefit a lot of users. Although, to the best of my knowledge, many other database vendors have long supported approximate count distinct queries (e.g., using HyperLogLog algorithm). Here’s a paper for those of you who are interested in learning more about these new features in Oracle 12C.
Bio: Barzan Mozafari is an Assistant Professor of Computer Science and Engineering at the University of Michigan, Ann Arbor, where he leads a research group designing the next generation of scalable databases using advanced statistical models. Prior to that, he was a Postdoctoral Associate at MIT. He earned his Ph.D. in Computer Science from UCLA in 2011. His research career has led to many successful open-source projects, including CliffGuard (the first robust framework for database tuning), DBSeer (the first automated database diagnosis tool), and BlinkDB (the first massively parallel approximate query engine). He has won the National Science Foundation CAREER award, as well as several best paper awards in ACM SIGMOD and EuroSys. He is also a co-founder of DBSeerand a strategic advisor to SnappyData, a company that commercializes the ideas introduced by BlinkDB.
- 3 Signs That BI Will Never Be The Same Again
- Which Database is best for an Analyst?
- Evaluating HTAP Databases for Machine Learning Applications