KDnuggets Home » News » 2015 » Dec » Opinions, Interviews, Reports » Which Database is best for an Analyst? ( 15:n41 )

Which Database is best for an Analyst?

Database comparisons usually look at architecture, cost, scalability, and speed, but rarely address the other key factor: how hard is writing queries for these databases? We examine which of the top 8 databases are easiest to use.

By Benn Stancil, Chief Analyst at Mode.

Which database is best? The question, obviously, depends on what you want to use it for.

Best Database Search

I, like most analysts, want to use a database to warehouse, process, and manipulate data—and there’s no shortage of thoughtful commentary outlining the types of databases I should prefer. But these evaluations, which typically discuss databases in terms of architecture, cost, scalability, and speed, rarely address one other key consideration: how hard is it for analysts to write queries against these databases?

Relative to factors like processing speed and scalability, the particulars of a database’s query language may seem trivial. Regardless of how hard it is to get through a race car’s door, won’t it always be faster than a Prius?

For many analysts, however, we aren’t always driving 500 miles as fast as we can—we’re making lots of short trips to the grocery store. When you work with a database day in and day out, the annoyances that hinder every quick project—how do I get the current time in Redshift? NOW()? CURDATE()? CURDATE? SYSDATE? WHATDAYISIT?—often slow you down more than a lower top speed.

With this in mind, I decided to approach the “which database is best?” question from a different angle. I wanted to find the easiest database to query.

I looked into the question as any analyst would—by using data. Analysts write thousands of queries in a number of different languages in Mode every day. Though Mode supports 11 types of databases, my analysis focused on the eight most popular: MySQL, PostgreSQL, Redshift, SQL Server, BigQuery, Vertica, Hive, and Impala. I looked at millions of queries run in Mode’s editor, which excludes all scheduled runs, reports run in lists, and reports run with parameters by people other than the query’s authors.

So which databases, despite how fast they go and how much they cost, have doors that are just too hard to get in and out of?

A basic measure of difficulty

The most basic indicator that an analyst is having trouble with a query is when it fails. These error messages, (constantly) rejecting bad syntax, misnamed functions, or a misplaced comma, probably provide the truest indication of how much a language frustrates an analyst.

I started simple, looking at how often queries fail. As it turns out, Vertica and SQL Server have the highest error rates and MySQL and Impala the lowest. The chart below shows the error rates for each database.

Database Failure Rates

Unfortunately for our wallets (Impala, MySQL, and Hive are all open-source and free, while Vertica, SQL Server, and BigQuery are decidedly not), rates like these are probably too crude to be conclusive.

People use databases for different things. Vertica and SQL Server are proprietary databases provided by major vendors, and most likely used by large businesses with deeper analytical budgets. The high error rates from these languages may come from a more ambitious use of the language rather than the language being “harder.”

Controlling for query complexity

Can we then adjust for how complex a query is? Unfortunately, controlling for query complexity is hard.

Query length could be a decent proxy, but it’s not perfect.

Median Query Length

An easy language may be easy because it’s concise. Or, as anyone who’s attempted to parse a string of seemingly random brackets, backslashes, and periods in a regular expression will tell you, a language may be hard because it’s concise.

While there are clear differences in query lengths across different languages, the relationships between query length, query complexity, and language difficulty are all intertwined. Figuring out these relationships sounds even more daunting than parsing regex.

But we may be able to control for complexity in other ways. Queries often evolve over the course of an analysis. They start as simple explorations and become more complex as analysts add layers. You can see this evolution in the chart below, which shows how the median query doubles in length after 20 or so edits, and triples after 100 edits.

Query Length Relative to Initial Length

Rather than comparing queries of similar lengths, we could instead compare queries at the same stage in the analytical process. How often does the first query run result in an error? The fifth? The 20th?

The chart below shows the error rates for queries by the number of times analysts have edited them. After five or so runs, a few clear patterns emerge. PostgreSQL, MySQL, and Redshift have consistently low error rates. Impala, BigQuery, and SQL Server have high error rates. And as before, Vertica consistently outpaces the rest with the highest error rate.

Database Failure Rate by Edit Number

This suggests that more traditional versions of SQL—PostgreSQL and MySQL—are the easiest SQL languages to use. Among analytical databases, Redshift takes a clear lead over languages like Vertica and SQL Server.