The 5 Hardest Things to Do in SQL
The 5 hardest things Josh Berry, a 15 year analytics professional, experienced while switching from Python to SQL. Offering examples, SQL code, and a resource to customize the SQL to your own project.
Sql vector created by freepik - www.freepik.com
Many of us have experienced the core power of speed and efficiency delivered by centralizing compute within the Cloud Data Warehouse. While this is true, many of us have also realized that, like with anything, this value comes with its own set of downsides.
One of the primary drawbacks of this approach is that you must learn and execute queries in different languages, specifically SQL. While writing SQL is faster and less expensive than standing up a secondary infrastructure to run python (on your laptop or in-office servers), it comes with many different complexities depending on what information the data analyst wants to extract from the cloud warehouse. The switch over to cloud data warehouses increases the utility of complex SQL versus python. Having been through this experience myself, I decided to record the specific transformations that are the most painful to learn and perform in SQL and provide the actual SQL needed to alleviate some of this pain for my readers.
To aid in your workflow, you’ll notice that I provide examples of the data structure before and after the transform is executed, so you can follow along and validate your work. I have also provided the actual SQL needed to perform each of the 5 hardest transformations. You’ll need new SQL to perform the transformation across multiple projects as your data changes. We’ve provided links to dynamic SQL for each transformation so you can continue to capture the SQL needed for your analysis on an as needed basis!
It is not clear where the term date spine originated, but even those who don’t know the term are probably familiar with what it is.
Imagine you are analyzing your daily sales data, and it looks like this:
No sales happened on the 16th and 17th, so the rows are completely missing. If we were trying to calculate average daily sales, or build a time series forecast model, this format would be a major problem. What we need to do is insert rows for the missing days.
Here is the basic concept:
- Generate or select unique dates
- Generate or select unique products
- Cross Join (cartesian product) all combinations of 1&2
- Outer Join #3 to your original data
The end result will look like this:
Pivot / Unpivot
Sometimes, when doing an analysis, you want to restructure the table. For instance, we might have a list of students, subjects, and grades, but we want to break out subjects into each column. We all know and love Excel because of its pivot tables. But have you ever tried to do it in SQL? Not only does every database have annoying differences in how PIVOT is supported, but the syntax is unintuitive and easily forgettable.
This one isn’t necessarily difficult but is time-consuming. Most data scientists don’t consider doing one-hot-encoding in SQL. Although the syntax is simple, they would rather transfer the data out of the data warehouse than the tedious task of writing a 26-line CASE statement. We don’t blame them!
However, we recommend taking advantage of your data warehouse and its processing power. Here is an example using STATE as a column to one-hot-encode.
Market Basket Analysis
When doing a market basket analysis or mining for association rules, the first step is often formatting the data to aggregate each transaction into a single record. This can be challenging for your laptop, but your data warehouse is designed to crunch this data efficiently.
Typical transaction data:
|SO51247||11249||Water Bottle - 30 oz.||4.99||1/1/2013|
|SO51247||11249||Mountain Bottle Cage||9.99||1/1/2013|
|SO51246||25625||Water Bottle - 30 oz.||4.99||12/31/2012|
|SO51246||25625||Road Bottle Cage||8.99||12/31/2012|
|207||Mountain Bottle Cage, Water Bottle - 30 oz.|
|200||Mountain Tire Tube, Patch Kit/8 Patches|
|142||LL Road Tire, Patch Kit/8 Patches|
|137||Patch Kit/8 Patches, Road Tire Tube|
|135||Patch Kit/8 Patches, Touring Tire Tube|
|132||HL Mountain Tire, Mountain Tire Tube, Patch Kit/8 Patches|
Time series aggregations are not only used by data scientists but they’re used for analytics as well. What makes them difficult is that window functions require the data to be formatted correctly.
For example, if you want to calculate the average sales amount in the past 14 days, window functions require you to have all sales data broken up into one row per day. Unfortunately, anyone who has worked with sales data before knows that it is usually stored at the transaction level. This is where time-series aggregation comes in handy. You can create aggregated, historical metrics without reformatting the entire dataset. It also comes in handy if we want to add multiple metrics at one time:
- Average sales in the past 14 days
- Biggest purchase in last 6 months
- Count Distinct product types in last 90 days
If you wanted to use window functions, each metric would need to be built independently with several steps.
A better way to handle this, is to use common table expressions (CTEs) to define each of the historical windows, pre-aggregated.
|Transaction ID||Customer ID||Product Type||Purchase Amt||Transaction Date|
|Transaction ID||Customer ID||Product Type||Purchase Amt||Transaction Date||Avg Sales Past 14 Days||Max Purchase Past 6 months||Count Distinct Product Type last 90 days|
I hope this piece helps shed some light on the different troubles that a data practitioner will encounter when operating within the modern data stack. SQL is a double-edged sword when it comes to querying the cloud warehouse. While centralizing the compute in the cloud data warehouse increases speed, it sometimes requires some extra SQL skills. I hope that this piece has helped answer questions and provides the syntax and background needed to tackle these problems.
Josh Berry (@Twitter) leads Customer Facing Data Science at Rasgo and has been in the data and analytics profession since 2008. Josh spent 10 years at Comcast where he built the data science team and was a key owner of the internally developed Comcast feature store - one of the first feature stores to hit the market. Following Comcast, Josh was a critical leader in building out Customer Facing Data Science at DataRobot. In his spare time Josh performs complex analysis on interesting topics such as baseball, F1 racing, housing market predictions, and more.