Solving 5 Complex SQL Problems: Tricky Queries Explained

The 5 hardest things Josh Berry, a 15 year analytics professional, experienced while switching from Python to SQL. Offering examples, SQL code, and a resource to customize the SQL to your own project.

By Josh Berry, Leads Customer Facing Data Science at Rasgo on May 31, 2023 in SQL

Image by Editor

Many of us have experienced the core power of speed and efficiency delivered by centralizing compute within the Cloud Data Warehouse. While this is true, many of us have also realized that, like with anything, this value comes with its own set of downsides.

One of the primary drawbacks of this approach is that you must learn and execute queries in different languages, specifically SQL. While writing SQL is faster and less expensive than standing up a secondary infrastructure to run python (on your laptop or in-office servers), it comes with many different complexities depending on what information the data analyst wants to extract from the cloud warehouse. The switch over to cloud data warehouses increases the utility of complex SQL versus python. Having been through this experience myself, I decided to record the specific transformations that are the most painful to learn and perform in SQL and provide the actual SQL needed to alleviate some of this pain for my readers.

To aid in your workflow, you’ll notice that I provide examples of the data structure before and after the transform is executed, so you can follow along and validate your work. I have also provided the actual SQL needed to perform each of the 5 hardest transformations. You’ll need new SQL to perform the transformation across multiple projects as your data changes. We’ve provided links to dynamic SQL for each transformation so you can continue to capture the SQL needed for your analysis on an as needed basis!

Date Spines

It is not clear where the term date spine originated, but even those who don’t know the term are probably familiar with what it is.

Imagine you are analyzing your daily sales data, and it looks like this:

sales_date	product	sales
2022-04-14	A	46
2022-04-14	B	409
2022-04-15	A	17
2022-04-15	B	480
2022-04-18	A	65
2022-04-19	A	45
2022-04-19	B	411

No sales happened on the 16th and 17th, so the rows are completely missing. If we were trying to calculate average daily sales, or build a time series forecast model, this format would be a major problem. What we need to do is insert rows for the missing days.

Here is the basic concept:

Generate or select unique dates
Generate or select unique products
Cross Join (cartesian product) all combinations of 1&2
Outer Join #3 to your original data

Customizable SQL for Datespine

The end result will look like this:

sales_date	product	sales
2022-04-14	A	46
2022-04-14	B	409
2022-04-15	A	17
2022-04-15	B	480
2022-04-16	A	0
2022-04-16	B	0
2022-04-17	A	0
2022-04-17	B	0
2022-04-18	A	65
2022-04-18	B	0
2022-04-19	A	45
2022-04-19	B	411

Pivot / Unpivot

Sometimes, when doing an analysis, you want to restructure the table. For instance, we might have a list of students, subjects, and grades, but we want to break out subjects into each column. We all know and love Excel because of its pivot tables. But have you ever tried to do it in SQL? Not only does every database have annoying differences in how PIVOT is supported, but the syntax is unintuitive and easily forgettable.

Before:

Student	Subject	Grade
Jared	Mathematics	61
Jared	Geography	94
Jared	Phys Ed	98
Patrick	Mathematics	99
Patrick	Geography	93
Patrick	Phys Ed	4

Customizable SQL for Pivot

Result:

Student	Mathematics	Geography	Phys Ed
Jared	61	94	98
Patrick	99	93	4

One-hot Encoding

This one isn’t necessarily difficult but is time-consuming. Most data scientists don’t consider doing one-hot-encoding in SQL. Although the syntax is simple, they would rather transfer the data out of the data warehouse than the tedious task of writing a 26-line CASE statement. We don’t blame them!

However, we recommend taking advantage of your data warehouse and its processing power. Here is an example using STATE as a column to one-hot-encode.

Before:

Babyname	State	Qty
Alice	AL	156
Alice	AK	146
Alice	PA	654
…	…	…
Zelda	NY	417
Zelda	AL	261
Zelda	CO	321

Customizable SQL for One-Hot Encode

Result:

Babyname	State	State_AL	State_AK	…	State_CO	Qty
Alice	AL	1	0	…	0	156
Alice	AK	0	1	…	0	146
Alice	PA	0	0	…	0	654
…	…			…		…
Zelda	NY	0	0	…	0	417
Zelda	AL	1	0	…	0	261
Zelda	CO	0	0	…	1	321

Market Basket Analysis

When doing a market basket analysis or mining for association rules, the first step is often formatting the data to aggregate each transaction into a single record. This can be challenging for your laptop, but your data warehouse is designed to crunch this data efficiently.

Typical transaction data:

SALESORDERNUMBER	CUSTOMERKEY	ENGLISHPRODUCTNAME	LISTPRICE	WEIGHT	ORDERDATE
SO51247	11249	Mountain-200 Black	2294.99	23.77	1/1/2013
SO51247	11249	Water Bottle - 30 oz.	4.99		1/1/2013
SO51247	11249	Mountain Bottle Cage	9.99		1/1/2013
SO51246	25625	Sport-100 Helmet	34.99		12/31/2012
SO51246	25625	Water Bottle - 30 oz.	4.99		12/31/2012
SO51246	25625	Road Bottle Cage	8.99		12/31/2012
SO51246	25625	Touring-1000 Blue	2384.07	25.42	12/31/2012

Customizable SQL for Market Basket

Result:

NUMTRANSACTIONS	ENGLISHPRODUCTNAME_LISTAGG
207	Mountain Bottle Cage, Water Bottle - 30 oz.
200	Mountain Tire Tube, Patch Kit/8 Patches
142	LL Road Tire, Patch Kit/8 Patches
137	Patch Kit/8 Patches, Road Tire Tube
135	Patch Kit/8 Patches, Touring Tire Tube
132	HL Mountain Tire, Mountain Tire Tube, Patch Kit/8 Patches

Time-Series Aggregations

Time series aggregations are not only used by data scientists but they’re used for analytics as well. What makes them difficult is that window functions require the data to be formatted correctly.

For example, if you want to calculate the average sales amount in the past 14 days, window functions require you to have all sales data broken up into one row per day. Unfortunately, anyone who has worked with sales data before knows that it is usually stored at the transaction level. This is where time-series aggregation comes in handy. You can create aggregated, historical metrics without reformatting the entire dataset. It also comes in handy if we want to add multiple metrics at one time:

Average sales in the past 14 days
Biggest purchase in last 6 months
Count Distinct product types in last 90 days

If you wanted to use window functions, each metric would need to be built independently with several steps.

A better way to handle this, is to use common table expressions (CTEs) to define each of the historical windows, pre-aggregated.

For example:

Transaction ID	Customer ID	Product Type	Purchase Amt	Transaction Date
65432	101	Grocery	101.14	2022-03-01
65493	101	Grocery	98.45	2022-04-30
65494	101	Automotive	239.98	2022-05-01
66789	101	Grocery	86.55	2022-05-22
66981	101	Pharmacy	14	2022-06-15
67145	101	Grocery	93.12	2022-06-22

Customizable SQL for Time Series Aggregate SQL

Result:

Transaction ID	Customer ID	Product Type	Purchase Amt	Transaction Date	Avg Sales Past 14 Days	Max Purchase Past 6 months	Count Distinct Product Type last 90 days
65432	101	Grocery	101.14	2022-03-01	101.14	101.14	1
65493	101	Grocery	98.45	2022-04-30	98.45	101.14	2
65494	101	Automotive	239.98	2022-05-01	169.21	239.98	2
66789	101	Grocery	86.55	2022-05-22	86.55	239.98	2
66981	101	Pharmacy	14	2022-06-15	14	239.98	3
67145	101	Grocery	93.12	2022-06-22	53.56	239.98	3

Conclusion

I hope this piece helps shed some light on the different troubles that a data practitioner will encounter when operating within the modern data stack. SQL is a double-edged sword when it comes to querying the cloud warehouse. While centralizing the compute in the cloud data warehouse increases speed, it sometimes requires some extra SQL skills. I hope that this piece has helped answer questions and provides the syntax and background needed to tackle these problems.

Josh Berry (@Twitter) leads Customer Facing Data Science at Rasgo and has been in the data and analytics profession since 2008. Josh spent 10 years at Comcast where he built the data science team and was a key owner of the internally developed Comcast feature store - one of the first feature stores to hit the market. Following Comcast, Josh was a critical leader in building out Customer Facing Data Science at DataRobot. In his spare time Josh performs complex analysis on interesting topics such as baseball, F1 racing, housing market predictions, and more.

Solving 5 Complex SQL Problems: Tricky Queries Explained

Date Spines

Pivot / Unpivot

One-hot Encoding

Market Basket Analysis

Time-Series Aggregations

Conclusion

More On This Topic

Latest Posts

Top Posts