7 Steps to Mastering SQL for Data Science — 2019 Edition
Follow these updated 7 steps to go from SQL data science newbie to practitioner in a hurry. We consider only the necessary concepts and skills, and provide quality resources for each.
What is Structured Query Language (SQL) and why is it so integral to data management?
Some time ago I wrote 7 Steps to Mastering SQL for Data Science, a post which attempted to aggregate and organize some of the available, free quality material into a short but helpful crash course. However, the included material is now getting stale and many of the links no longer work, which makes sense considering that the original has been around for a few years at this point. As folks have recently been asking for an updated version, I thought it would be a good time to revisit the concept and put together a new learning path for mastering SQL for data science.
This time around, we aim to, once again, lay out the path to SQL mastery. Let's make sure we view 'mastery' in a relative sense, however, and not expect to be scripting at database guru-levels after getting through the materials. The learning path is aimed at those with some understanding of databases, programming, computer science concepts, and/or data management in an abstract sense, who are wanting to be able to use SQL to manage and query their own data and scale up to larger systems.
Instead of having a high number of resources for each topic step (say, Views & Joins), I have tried to select a quality reading resource or two, along with an accessible video covering the given topic.
So grab a cup of your favorite beverage and settle in, as we start mastering SQL for data science in these 7 steps.
Step 1: Relational Database Basics
Let's start by revisiting the question posed at the start of this article: What is Structured Query Language (SQL) and why is it so integral to data management?
Very simply, SQL is the language for managing and querying data in relational database management systems (RDBMS).
So intertwined are the terms SQL and RDBMS that they are often conflated, sometimes by the uninitiated, but often simply out of convenience, and the term SQL is used adversarially to distinguish relational systems from non-relational database systems, which are often categorized by the umbrella term "NoSQL." Yet, SQL skills aren't wasted on non-RDBMS systems; the top data processing frameworks all have some implementation of SQL that sits atop their architecture.
Let's first watch and listen to Prof. Andy Pavlo of Carnegie Mellon University discuss databases, database systems, the relational model, relational algebra, and related concepts. This is a great starting point.
Step 2: SQL Overview
Now that we're all familiar with the relational paradigm, we shift focus to SQL.
Let's approach this initial exposure to SQL as follows. First watch the video below, titled "SQL - A Brief Review," by Charles Germany, for some very quick insight into where SQL came from, along with some syntax by way of a few short examples. As with the resources and examples below the video, don't worry about fully understanding the syntax, just get a feel for what SQL is doing, what it can be used for, and what it looks like.
Next, have a look at these 2 short tutorials: SQL Basic Queries (via Data Carpentry) does what its title suggests, while List of SQL Commands (via Codecademy) provides a more broad overview of SQL commands and usage. This second resource in particular should make for a handy reference later on. Some of these commands are covered in greater detail in subsequent steps, so, again, don't worry about understanding everything. At the same time, these initial examples are rather intuitive, and so should be helpful for your understanding.
Finally, in preparation of the of next step, get an SQL environment up and running. You may not want to enter every SQL statement you encounter, but having an SQL interpreter up and running just makes sense. I suggest installing SQLite locally; it is a simple, but capable, SQL installation.
Step 3: Selecting, Inserting, Updating
While SQL can be used to perform a number of varied data management tasks, querying databases with SELECT, inserting records with INSERT, and updating existing records with UPDATE are some of the most heavily-used commands, and are good places to get started in practical SQL.
Read and go through the following preliminary exercises, all courtesy of SQL Course.
This tutorial, MySQL Tutorial 1: Overview, Tables, Queries from Arachnoid, covers SQL basics and beginner to intermediate queries, and is a solid review of what we have seen to this point, and then some. It is presented from a MySQL point of view, but nothing in here is exclusive to that platform's SQL implementation.
Step 4: Creating, Dropping, Deleting
Our second set of commands include those used to CREATE and DROP tables, as well as to DELETE records. With an understanding of this growing collection of commands, suddenly much of what could be referred to as regular data management and query is attainable (with practice, of course).
Try out the following tutorials again via SQL Course.
Step 5: Views and Joins
On to some slightly more advanced SQL topics.
First, we have a look at views, which can be thoughts of as virtual tables populated by the results of queries, useful for a number of different scenarios including application development, data security, and eased data sharing. Watch this video from Socratica for insight.
Joins come in different flavors, and likely one of the more complex topics you will cover while learning SQL is getting them straight. That's really more of a testament to the ease of SQL than the actual difficulty of learning about joins. Get a better understanding of joins with this Socratica video.
This tutorial, MySQL Tutorial 2: Views and Joins from Arachnoid, covers vies and joins. It is also presented from a MySQL point of view, but nothing in here is exclusive to that platform's SQL implementation.
Step 6: Advanced SQL
Advanced SQL can mean a lot of different things, and so we will narrow our scope here.
First watch this lecture from Prof. Andy Pavlo of Carnegie Mellon University, wherein he discusses aggregates, distinct aggregates, group by, having, string operations, date and time operations, nested queries, table expressions, and more.
Step 7: Query Optimization
Once you know how to write queries, you are going to want to learn how to optimize them for both results and run time. This is especially true of complex queries on large databases.
Watch Prof. Andy Pavlo of Carnegie Mellon University cover this topic in the following lecture video.
Hopefully you have learned enough SQL at this point to consider yourself a "master" of the subject as relates to data science. But don't stop here; there are plenty more free quality resources online which you can use to build on this new foundation of mastery. You can never know enough SQL, and as with so many other skills, practice in the key to reinforcement.
- 7 Steps to Mastering Basic Machine Learning with Python — 2019 Edition
- SQL Case Study: Helping a Startup CEO Manage His Data
- To SQL or not To SQL: that is the question!