KDnuggets Top Blog Winner

7 Steps to Mastering SQL for Data Science

SQL is a must-know for anyone working in the data industry. Here’s how you can learn it from scratch

7 Steps to Mastering SQL for Data Science


Why Should You Learn SQL To Become A Data Scientist?


Knowledge of SQL is a prerequisite to apply to a majority of open data science positions. In fact, according to this 2021 analysis, SQL was the most in-demand technical skill for data jobs, followed by Python and machine learning.

Yet, data science courses and boot camps don’t emphasize teaching students to deal with large amounts of data. The focus area of most data science learning material is on predictive modeling, and candidates who complete these programs are left without the ability to query and manipulate databases.

When I started my first data science internship, I was excited to start building machine learning algorithms during my first day. However, the task I was given was completely different from what I’d expected. I had to query data from a database, clean it, and perform an analysis to answer a business question.

After my first week at the company, I realized that many business use-cases in organizations didn’t even require predictive modeling to solve. Often, a simple SQL query sufficed to filter and aggregate data according to the stakeholder’s requirement.

I wasn’t able to perform these tasks as quickly as my co-workers initially, and spent more time on data extraction and pre-processing since I was unfamiliar with SQL. Luckily, as this was just an internship, expectations weren’t too high, and I was able to improve my querying skills as I went on.

The biggest piece of advice I can give aspiring data scientists is to learn SQL. This is an often-overlooked skill by most data science learning providers but is arguably as important as machine learning modeling.

Even for tasks that require you to build a fancy predictive algorithm, knowledge of SQL is a must. Data pipelines in most organizations are stored in the form of relational databases, and you need to pull data from these databases and pre-process it before you can even begin to build ML models.

If you lack knowledge of SQL, you will spend a lot more time than expected on data preparation and analysis even if you are an expert at machine learning.

In this article, I will walk you through 7 steps you can take to master SQL for any data science or analytics role.


How To Learn SQL For Data Science


Step 1: SQL Basics


As a data scientist, you will be reading from databases and analyzing data to fit your use-case. You generally don’t need to create or manipulate existing databases — companies have a separate team to do this.

If you have no prior SQL knowledge whatsoever, start with this tutorial to understand what an RDBMS is.

Then, watch this YouTube video by Lucidchart to learn to create and read ER Diagrams. An ERD is a structural diagram used to visualize the tables in a database and the relationship between them. As a data scientist, when extracting data from different tables, you’d often need to refer to an ER Diagram to understand how the tables interact with each other.

After that, you can immediately start learning how to query data in SQL. I highly recommend following along to these tutorials by W3Schools to learn the following commands — SELECT, IN, WHERE, BETWEEN, AND, OR, NOT, LIKE.

These are some of the simplest SQL commands used to query and filter database tables. Once you’re familiar with them, start learning CASE statements. They’re very similar to if-else commands in any programming language.


Step 2: Aggregations


SQL aggregate functions are used to perform calculations on multiple table values and return a single result. SQL has 5 aggregate functions — SUM, COUNT, AVG, MIN, MAX


Step 3: Grouping and Sorting


Next, learn about the GROUPBY and ORDERBY commands. These are especially useful when you need to view your data in different groups or sort rows in a specific order.

It is also useful to learn the HAVING clause, as it’s used frequently with the above commands.

Step 4: Joins


All the queries above can only be used to extract data from a single table. If you’d like to combine data in multiple tables, you need to learn the JOIN command. 

Here is a visual representation of SQL joins:

visual representation of SQL joins
Image by CodeProject


Edureka released a free video on YouTube titled SQL Joins Tutorial For Beginners that you can follow along to. You can also choose to code along tovthis W3Schools tutorial on different joins. It is also useful to learn the SQL UNION operator once you’re done.


Step 5: Subqueries


Subqueries are also called nested queries in SQL, and are used when the result you want requires more than one query. In a nested query, the result of the inner command is used as input in the main query. 

This might seem confusing at first, but is actually a fairly intuitive concept once you get used to it. 

If you’d like to learn to use subqueries in SQL, read this article by W3Resources.


Step 6: SQL to Solve Business Problems


As a data scientist, the value you bring to an organization lies in your ability to use data to solve a business problem. When given a use-case by a stakeholder, you need to be able to translate this requirement into a technical analysis.

For example, your manager requests for a list of customers that should be targeted for different industries based on their online browsing behaviour. As a data scientist, you will need to break this task down into the following steps:

  1. Look into websites that these customers visited, and segregate them by industry based on their website visits. This can be done with some basic filtering and grouping in SQL.
  2. Then, you can look into recency and frequency of website visits to identify high-potential customers to be targeted in these industries. This might require some additional data pre-processing, filtering, and possibly ranking.
  3. Finally, you can handover the filtered output data grouped by sector to your manager. If you’d like to enrich these customer categories, you can even build a clustering algorithm on top of this data to identify high-potential individuals.

The example above is simple, but captures the thought process of a data scientist when provided with a business problem statement. This is a skill that is developed over time, with practice.

Udemy has a course on SQL Business Intelligence designed to help students used SQL to support better decision making. The first part of this program covers the fundamentals of SQL (joins, operators, subqueries, aggregations, etc), and the second half is focused on applying the knowledge learnt to solve business problems.


Step 7: Window Functions


Window functions are a slightly more advanced SQL topic. They enable users to perform calculations against partitions of a result set. To learn SQL window functions, follow along to this YouTube video.


Practice, Practice, Practice


Learning all the concepts listed above will help you build a strong foundation of programming with SQL. However, in order to tackle real-world use-cases, you need to practice a lot. 

HackerRank and PGExercises are two platforms that can help you do this. They have a series of SQL problems that range from beginner to advanced, and solving these questions will give you a much better grasp of the language.

Sites like HackerRank are often used by hiring managers to assess a candidate’s proficiency at different programming languages, and solving their SQL problems will increase your chances of acing data science interviews.

Natassha Selvaraj is a self-taught data scientist with a passion for writing. You can connect with her on LinkedIn.