How to Make Your Database 200x Faster Without Having to Pay More
Waiting long for a BI query to execute? I know it’s annoyingly frustrating… It’s a major bottle neck in day-to-day life of a Data Analyst or BI expert. Let’s learn some of the easy to use solutions and a very good explanation of why to use them, along with other advanced technological solutions.
Return 99.9%-accurate answers for low-priority queries, and 100% for the rest! Again, this is possible because according to laws of statistics you can usually get a 99.9% accurate answer by using only 0.1% of the data and processing. This is why sacrificing 0.1% accuracy can mean 100-200x speedup in practice. I know no one wants to be the one settling for using 99.9% accuracy, but you need to consider the other options: having your query terminated, put on a long waitlist, or staying idle waiting for your query to finish. As I mentioned under Scenario 1, in many cases you can still get the full answer and just use the 99.9%-accurate answer while you’re waiting for the full answer. I will get back to the “how” question at the end of this post. But for now, remember that 99.9% accuracy doesn’t mean you miss 0.1% of your output. You still see everything usually but they actual numbers might be off by 0.1%, which means in most cases you can’t even tell the difference unless you really squint. Compare these two plots
These are the output of a query that asks the total trip time to downtown using the famous NYC cab dataset
Can you tell which one is the 100% accurate answer and which one is the 99.9% accurate one? To most humans, these two are identical. But the top query took only 1.7 seconds while the bottom one took 42.7 seconds. This means by sacrificing 0.1% accuracy, you saved 25x on your CPU time! Let’s talk about one last scenario and then I promise I will tell you more about the “how” part.
Scenario 3 (Machine Learning and Data Science). If you’re a machine learning guru or a data scientist, you often find yourself training statistical models, parameter tuning, or just doing feature selection and engineering. One of the most common frustrations here is the massive number of parameters or features that you need to try out and the long time it takes to train machine learning models. The clusters are constantly busy training and testing different models, and this limits the set of different models and parameters that data scientists can try, or at least slows down the process.
In many applications, you can make perfectly reasonable decisions without perfectly accurate answers. A/B testing, root-cause analysis, feature selection, visualization, noisy data, or datasets with missing values are all situations where this statement is certainly true. The only time you don’t want to do this is when you’re working in a billing department!
I will probably do a separate post on how to speed up parameter tuning or feature selection.
But “how” can we speed up our queries 200x but only sacrifice a tiny bit of accuracy?
The answer lies in a technique called Approximate Query Processing [S1] or AQP for short. There are different ways of doing AQP but the simplest solution is using a random sample of your tables. It turns out, if your data is skewed, you don’t want to use a random sample as it will miss most of the outliers and rare items that might exist in your original table. So what’s more common in practice is what’s called a “stratified sample”. What is a stratified sample? Consider the following table:
Let’s say you want to run the following query on this table:
SELECT avg(salary) FROM table WHERE city = ‘Ann Arbor’
Of course you can run this query, but if this table was 100’s of millions of rows, or it was partitioned across multiple machines, it could take several minutes to finish. Instead, you might decide to run it only on a random (a.k.a. uniform) sample of it, say:
As you can see, since Ann Arbor tuples were quite rare compared to NYC tuples in the original table, chances are you might see only a few of them or none at all in your sampled table. Instead, a stratified sample would first stratify (i.e., partition) the table based on the City and then sample the rows of each city at random:
Without getting into lot of statistics, it turns out that a stratified sample like this will ensure that you get very high accuracy by processing a very small fraction of the original tuples.