How to Make Your Database 200x Faster Without Having to Pay More
Waiting long for a BI query to execute? I know it’s annoyingly frustrating… It’s a major bottle neck in day-to-day life of a Data Analyst or BI expert. Let’s learn some of the easy to use solutions and a very good explanation of why to use them, along with other advanced technological solutions.
By Barzan Mozafari, U. of Michigan.
Almost everyone these days is complaining about performance in one way or another. It’s not uncommon for database administrators and programmers to constantly find themselves in a situation where their servers are maxed out, or their queries are taking forever. This frustration is way too common for all of us. The solutions are varied. The most typical one is squinting at the query and blaming the programmer for not being smarter with their query. Maybe they could have used the right index or materialized view or just re-write their query in a better way. Other times, you might have to spin up a few more nodes if your company is using a cloud service. In other cases, when your servers are overloaded with too many slow queries, you might set different priorities for different queries so that at least the more urgent one (e.g., CEO queries) finish faster. When the DB does not support priority queues, your admin might even cancel your queries to free up some resources for the more urgent queries.
No matter which one of these experiences you’ve had, you’re probably familiar with the pain of having to wait for slow queries or having to pay for more cloud instances or buying faster and bigger servers. Most people are familiar with traditional database tuning and query optimization techniques, which come with their own pros and cons. So we’re not going to talk about those here. Instead, in this post, we’re going to talk about more recent techniques that are far less known to people and in many cases actually lead to much better performance and saving opportunities.
To start, consider these scenarios:
Scenario 1 (Exploratory Analytics). You’re an analyst slicing and dicing your data looking for insight, patterns or testing various hypotheses you have about your business, customers, or services. In these situations, you don’t usually know what you’re exactly going to see. You run a query, look at the results, and then decide if you need to run a different query. In other words, you engage in a series of exploratory and adhoc queries until you find what you want. Only some of the queries that you run are going to be used to populate a company report, fill out a form, or generate a plot for your customer. But each time you submit a query, depending on your data size and the number of other concurrent queries in the cluster, you may have to wait several minutes to get an answer. While waiting, you’re usually idle and may not be able to decide on your next query because it depends on the output of your previous query, which you are still waiting on!
Solution: While waiting, you can immediately see an “almost perfect” answer. What do I mean by an “almost perfect” answer? Compare the two plots below.
These are the outputs of the same BI tool after running a query that loads data from a backend database. The one in the bottom takes 71 minutes to finish as it’s using all 1B data points, whereas the one on the top uses only 1M data points but finishes in only 3 seconds! Of course the plot on the top is slightly blurry compared to the final version in the bottom. But think about it: isn’t it worth it? Instead of waiting 71 mins, you get an immediate answer that is almost identical and then can decide if you want to wait another 71 mins for the full answer or just move on with your life! I can’t imagine anyone not being able to decide whether the full version is worth the wait by looking at the 3-sec version!
Is this is a new idea: certainly not! In fact, all web-browsers already do this. Next time you try to load a high-resolution image on your browser pay attention to how your web browser first tries to load and display a rough image which gradually gets finer and finer. But applying the same principle in a database and SQL query processing is far less known.
So you might have a few questions at this point: how can we achieve such a speedup in practice? Does this work even if your data distribution is skewed? Do you still see the outliers? Do you need to use a specific database to enjoy this type of tradeoff between speed and accuracy? I will hopefully answer all these questions towards the end of this post, but I want to first tell you a few more scenarios in which you might find the same idea appealing: seeing an immediate result that is 99.9% accurate but 200x faster!
Scenario 2 (Overloaded Cluster). Like most database users today, you may not have a dedicated cluster to yourself. In other words, you’re sharing a cluster with your team, or even other reporting and BI tools that fire up SQL queries to the same shared pool of database resources. When these shared clusters are overloaded, only one of three things can happen in practice:
A. Global frustration. You do nothing, and let everyone suffer equally. In other words, once the database queues are backed-up and the CPU cores are maxed out, no one can get their queries finished fast enough.
B. Partial frustration. You can do something smarter by terminating or suspending lower-priority queries to let the more urgent queries (e.g., those going to your manager!) finish first. In other words, you keep a few more important guys happy but make everyone else angry!
C. If these situations happen often enough, you may buy more and bigger servers or migrate to the cloud and spin up more nodes on demand. Of course this option costs $$$ and is inconvenient. Plus, it’s usually not a short-term solution.
One thing that most people don’t realize is that there is a forth option that is better than the first two options and unlike the third option doesn’t cost you money either. So what’s that option? That forth option is the following: