Which Database is best for an Analyst?
Tags: Amazon Redshift, Apache Hive, BigQuery, Data Analyst, Databases, MySQL, NoSQL, SQL Server, Vertica
Database comparisons usually look at architecture, cost, scalability, and speed, but rarely address the other key factor: how hard is writing queries for these databases? We examine which of the top 8 databases are easiest to use.
Query complexity, however, isn’t the only factor affecting error rates. If you’ve seen me bulldoze through 15 syntax errors in a row, you know the skill of the analyst also matters.
Twenty percent of analysts using Mode write queries against more than one type of database. Personally, I regularly use PostgreSQL and Redshift, and sometimes MySQL and BigQuery.
These multi-lingual analysts offer us an opportunity. Among people who use different languages, which are they most comfortable with? Does an analyst who uses PostgreSQL and BigQuery tend to have higher error rates in one language or another? If we could pit SQL languages against each other (in what would surely be the nerdiest round robin tournament ever), which one would win?
I used a method of pairwise comparisons to aggregate together these head-to-head matchups:
- I found all the analysts who’ve run a minimum of 10 queries per database for multiple databases.
- I calculated the query error rates for each analyst for each database.
- I averaged the differences in error rates for every database pair to construct the matrix below.
The matrix shows the difference in error rates of the database on the top row compared to the database on the left. Here, a higher number is worse than a lower number. For example, the “20.2” at the intersection of Hive and BigQuery indicates that, among analysts who use both of those databases, the error rate tends to be 20.2% higher for Hive than BigQuery.
The total score line on the bottom sums the differences for each database. The result provides a similar conclusion to the error-by-run analysis: MySQL and PostgreSQL are the easiest versions of SQL to write. Redshift also jumps up a couple spots, from the fourth easiest to the second easiest.
Vertica gains the most ground. It moves from being the most difficult language to somewhere near the middle of the pack, beating out SQL Server and Hive. This suggests that Vertica’s high error rate may be more indicative of the type of analyst that uses it than it is of the language itself.
Overall, these numbers point to MySQL and PostgreSQL as the easiest versions of SQL to write. Intuitively, this makes sense. Among the eight languages analyzed, these two are the most widely used and taught (Mode’s SQL School, for instance, uses PostgreSQL). Unfortunately for analysts, they’re also poorer in features—and often slower—than languages like Vertica and SQL Server.
For analysts looking for ease of use without sacrificing too much speed—driving to the grocery store is faster than walking, after all—Redshift is the clear winner. Add the collective vote of analysts using Mode to the growing pile of recommendations.