7 Steps to Mastering SQL for Data Science
Follow these 7 steps to go from SQL data science newbie to seasoned practitioner quickly. No nonsense, just the necessities.
Structured Query Language (forever and always referred to henceforth as SQL, and usually pronounced sequel) is the language for managing and querying data in relational database management systems (RDBMS).
So intertwined are the terms SQL and RDBMS that they are often conflated, sometimes by the uninitiated, but often simply out of convenience, and the term SQL is used adversarially to distignuish relational systems from non-relational database systems, which are often categorized by the umbrella term "NoSQL." Yet, SQL skills aren't wasted on non-RDBMS systems; the top data processing frameworks all have some implementation of SQL that sits atop their architecture, including Apache Big Data behemoths Spark and Hadoop.
Structured Query Language is usually pronounced sequel.
Together with R and Python, SQL forms the triumvirate of data science programming languages (liberal usage of the term programming languages); in fact, the most recent KDnuggets data science and analytics software poll results indicate that these are also the 3 most-used tools by respondents, when all software is considered.
Clearly, SQL is important in data science. As such, this post aims to take a reader from SQL newbie to competent practitioner in a short time, using freely-available online resources. Lots of such resources exist on the internet, but mapping out a path from start to finish, using items which complement each other, is not always as straightforward as it may seem. Hopefully this post can be of assistance in this manner.
Step 1: Relational Database Basics
First off, since SQL is used to manage and query data in relational databases, having an understanding of what, exactly, a relational database is would be a good start.
Jennifer Widom is a renowned database researcher and professor at Stanford. She also put together one of the very first MOOCs on Coursera. Videos from this course are a good place to start:
A good introductory read on the relational database model is presented by UMass Boston's Dan Simovici here:
Extra credit: For a better theoretical understanding of the foundations upon which relational database systems are built, have a look at Widom's videos on relational algebra:
Step 2: SQL Overview
Now that we're all familiar with the relational paradigm, we shift focus to SQL.
First, get a high-level overview, again from Widom:
Read the first few pages of Simovici's SQL notes, or until it stops making intuitive sense:
For a great extended introductory overview of basic SQL commands, watch the following video from Khan Academy:
Once finished with the video, the following is a good treatment of SQL commands, with explanations, and can also serve as a useful reference moving forward, so keep it handy:
Finally, in preparation of the of next step, get an SQL environment up and running. You may not want to enter every SQL statement you encounter, but having an SQL interpreter up and running just makes sense. I suggest installing SQLite locally; it is a simple, but capable, SQL installation:
Constructing SQL statements.
Step 3: Selecting, Inserting, Updating
While SQL can be used to perform a number of varied data management tasks, querying databases with SELECT, inserting records with INSERT, and updating existing records with UPDATE are some of the most heavily-used commands, and are good places to get started in practical SQL. Read and go through the following preliminary exercises:
This tutorial covers SQL basics and beginner to intermediate queries, and is a solid review of what we have seen to this point, and then some. It is presented from a MySQL point of view, but nothing in here is exclusive to that platform's SQL implementation.