KDnuggets Top Blog Winner

The 5 Hardest Things to Do in SQL

The 5 hardest things Josh Berry, a 15 year analytics professional, experienced while switching from Python to SQL. Offering examples, SQL code, and a resource to customize the SQL to your own project.



The 5 Hardest Things to do in SQL
Sql vector created by freepik - www.freepik.com

 

Many of us have experienced the core power of speed and efficiency delivered by centralizing compute within the Cloud Data Warehouse. While this is true, many of us have also realized that, like with anything, this value comes with its own set of downsides. 

One of the primary drawbacks of this approach is that you must learn and execute queries in different languages, specifically SQL. While writing SQL is faster and less expensive than standing up a secondary infrastructure to run python (on your laptop or in-office servers), it comes with many different complexities depending on what information the data analyst wants to extract from the cloud warehouse. The switch over to cloud data warehouses increases the utility of complex SQL versus python. Having been through this experience myself, I decided to record the specific transformations that are the most painful to learn and perform in SQL and provide the actual SQL needed to alleviate some of this pain for my readers. 

To aid in your workflow, you’ll notice that I provide examples of the data structure before and after the transform is executed, so you can follow along and validate your work. I have also provided the actual SQL needed to perform each of the 5 hardest transformations. You’ll need new SQL to perform the transformation across multiple projects as your data changes. We’ve provided links to dynamic SQL for each transformation so you can continue to capture the SQL needed for your analysis on an as needed basis! 

 

Date Spines

 

It is not clear where the term date spine originated, but even those who don’t know the term are probably familiar with what it is.

Imagine you are analyzing your daily sales data, and it looks like this:

sales_date product sales
2022-04-14 A 46
2022-04-14 B 409
2022-04-15 A 17
2022-04-15 B 480
2022-04-18 A 65
2022-04-19 A 45
2022-04-19 B 411

 

No sales happened on the 16th and 17th, so the rows are completely missing. If we were trying to calculate average daily sales, or build a time series forecast model, this format would be a major problem. What we need to do is insert rows for the missing days.

Here is the basic concept:

  1. Generate or select unique dates
  2. Generate or select unique products
  3. Cross Join (cartesian product) all combinations of 1&2
  4. Outer Join #3 to your original data

Customizable SQL for Datespine

The end result will look like this:

sales_date product sales
2022-04-14 A 46
2022-04-14 B 409
2022-04-15 A 17
2022-04-15 B 480
2022-04-16 A 0
2022-04-16 B 0
2022-04-17 A 0
2022-04-17 B 0
2022-04-18 A 65
2022-04-18 B 0
2022-04-19 A 45
2022-04-19 B 411

 

Pivot / Unpivot

 

Sometimes, when doing an analysis, you want to restructure the table. For instance, we might have a list of students, subjects, and grades, but we want to break out subjects into each column. We all know and love Excel because of its pivot tables. But have you ever tried to do it in SQL? Not only does every database have annoying differences in how PIVOT is supported, but the syntax is unintuitive and easily forgettable. 

Before:

Student Subject Grade
Jared Mathematics 61
Jared Geography 94
Jared Phys Ed 98
Patrick Mathematics 99
Patrick Geography 93
Patrick Phys Ed 4

 

Customizable SQL for Pivot

Result:

Student Mathematics Geography Phys Ed
Jared 61 94 98
Patrick 99 93 4

 

One-hot Encoding

 

This one isn’t necessarily difficult but is time-consuming. Most data scientists don’t consider doing one-hot-encoding in SQL. Although the syntax is simple, they would rather transfer the data out of the data warehouse than the tedious task of writing a 26-line CASE statement. We don’t blame them! 

However, we recommend taking advantage of your data warehouse and its processing power. Here is an example using STATE as a column to one-hot-encode.

Before:

Babyname State Qty
Alice AL 156
Alice AK 146
Alice PA 654
Zelda NY 417
Zelda AL 261
Zelda CO 321

 

Customizable SQL for One-Hot Encode

Result:

Babyname State State_AL State_AK State_CO Qty
Alice AL 1 0 0 156
Alice AK 0 1 0 146
Alice PA 0 0 0 654
Zelda NY 0 0 0 417
Zelda AL 1 0 0 261
Zelda CO 0 0 1 321

 

Market Basket Analysis

 

When doing a market basket analysis or mining for association rules, the first step is often formatting the data to aggregate each transaction into a single record. This can be challenging for your laptop, but your data warehouse is designed to crunch this data efficiently.

Typical transaction data:

SALESORDERNUMBER CUSTOMERKEY ENGLISHPRODUCTNAME LISTPRICE WEIGHT ORDERDATE
SO51247 11249 Mountain-200 Black 2294.99 23.77 1/1/2013
SO51247 11249 Water Bottle - 30 oz. 4.99 1/1/2013
SO51247 11249 Mountain Bottle Cage 9.99 1/1/2013
SO51246 25625 Sport-100 Helmet 34.99 12/31/2012
SO51246 25625 Water Bottle - 30 oz. 4.99 12/31/2012
SO51246 25625 Road Bottle Cage 8.99 12/31/2012
SO51246 25625 Touring-1000 Blue 2384.07 25.42 12/31/2012

 

Customizable SQL for Market Basket

Result:

NUMTRANSACTIONS ENGLISHPRODUCTNAME_LISTAGG
207 Mountain Bottle Cage, Water Bottle - 30 oz.
200 Mountain Tire Tube, Patch Kit/8 Patches
142 LL Road Tire, Patch Kit/8 Patches
137 Patch Kit/8 Patches, Road Tire Tube
135 Patch Kit/8 Patches, Touring Tire Tube
132 HL Mountain Tire, Mountain Tire Tube, Patch Kit/8 Patches

 

Time-Series Aggregations

 

Time series aggregations are not only used by data scientists but they’re used for analytics as well. What makes them difficult is that window functions require the data to be formatted correctly.

For example, if you want to calculate the average sales amount in the past 14 days, window functions require you to have all sales data broken up into one row per day. Unfortunately, anyone who has worked with sales data before knows that it is usually stored at the transaction level. This is where time-series aggregation comes in handy. You can create aggregated, historical metrics without reformatting the entire dataset. It also comes in handy if we want to add multiple metrics at one time:

  • Average sales in the past 14 days
  • Biggest purchase in last 6 months
  • Count Distinct product types in last 90 days

If you wanted to use window functions, each metric would need to be built independently with several steps.

A better way to handle this, is to use common table expressions (CTEs) to define each of the historical windows, pre-aggregated.

For example:

Transaction ID Customer ID Product Type Purchase Amt Transaction Date
65432 101 Grocery 101.14 2022-03-01
65493 101 Grocery 98.45 2022-04-30
65494 101 Automotive 239.98 2022-05-01
66789 101 Grocery 86.55 2022-05-22
66981 101 Pharmacy 14 2022-06-15
67145 101 Grocery 93.12 2022-06-22

 

Customizable SQL for Time Series Aggregate SQL

Result:

Transaction ID Customer ID Product Type Purchase Amt Transaction Date Avg Sales Past 14 Days Max Purchase Past 6 months Count Distinct Product Type last 90 days
65432 101 Grocery 101.14 2022-03-01 101.14 101.14 1
65493 101 Grocery 98.45 2022-04-30 98.45 101.14 2
65494 101 Automotive 239.98 2022-05-01 169.21 239.98 2
66789 101 Grocery 86.55 2022-05-22 86.55 239.98 2
66981 101 Pharmacy 14 2022-06-15 14 239.98 3
67145 101 Grocery 93.12 2022-06-22 53.56 239.98 3

 

Conclusion

 

I hope this piece helps shed some light on the different troubles that a data practitioner will encounter when operating within the modern data stack. SQL is a double-edged sword when it comes to querying the cloud warehouse. While centralizing the compute in the cloud data warehouse increases speed, it sometimes requires some extra SQL skills. I hope that this piece has helped answer questions and provides the syntax and background needed to tackle these problems. 

 
 
Josh Berry (@Twitter) leads Customer Facing Data Science at Rasgo and has been in the data and analytics profession since 2008. Josh spent 10 years at Comcast where he built the data science team and was a key owner of the internally developed Comcast feature store - one of the first feature stores to hit the market. Following Comcast, Josh was a critical leader in building out Customer Facing Data Science at DataRobot. In his spare time Josh performs complex analysis on interesting topics such as baseball, F1 racing, housing market predictions, and more.