Top Five SQL Window Functions You Should Know For Data Science Interviews
Focusing on the important concepts for data scientists.
SQL is the universal language in the data world and is the most important skill to nail down as a data professional.
The reason SQL is so important is that it is the main skill that is required during the data wrangling phase. A lot of data exploration, data manipulation, pipeline development, and dashboard creation is done through SQL.
What separates great data scientists from good data scientists is that great data scientists can wrangle data as much as the capabilities of SQL allow. A big part of being able to fully use all that SQL has to offer is by knowing how to use window functions.
With that said, let’s dive into it!
1. Deltas with LEAD() and LAG()
LEAD() and LAG() are mostly used when comparing one period of time with the previous period of time for a given metric. To give a few examples...
- You can get the delta between each year’s sales and the previous year’s sales
- You can get the delta in the number of sign-ups/conversions/website visits on a month to month basis
- You can compare user churn on a monthly basis
The following query shows how you can query the monthly percent change in costs
with monthly_costs as ( SELECT date , monthlycosts , LEAD(monthlycosts) OVER (ORDER BY date) as previousCosts FROM costs )SELECT date , (monthlycosts - previousCosts) / previousCosts * 100 AS costPercentChange FROM monthly_costs
2. Cumulative Sums with SUM() or COUNT()
Calculating running totals can be simply done through a windows function that starts with SUM() or COUNT(). This is a powerful tool when you want to show the growth of a particular metric over time. More specifically, it’s useful in the following circumstances:
- Get a running total of revenue and costs over time
- Get a running total of time-spent-on-app per user
- Get a running total of conversions over time
The following example shows how you can include a cumulative sum column of monthly costs:
SELECT date , monthlycosts , SUM(monthlycosts) OVER (ORDER BY date) as cumCosts FROM cost_table
3. Moving Averages with AVG()
AVG() is really powerful in windows functions as it allows you to compute moving averages over time.
Moving averages are a simple, yet effective, way to forecast values in the short term. They’re also extremely useful at smoothing out volatile curves on a graph. Generally, moving averages are used to gauge the general direction of where things are moving.
- They can be used to get the general trend of weekly sales (is the average going up over time?). This would indicate growth as a company.
- They can likewise be used to get the general trend of weekly conversions or website visits.
The following query is an example of getting the 10-day moving average for conversions.
SELECT Date , dailyConversions , AVG(dailyConversions) OVER (ORDER BY Date ROWS 10 PRECEDING) AS 10_dayMovingAverage FROM conversions
ROW_NUMBER() is particularly useful when you want to get the first or last record. For example, if you have a table of when gym members came to the gym and you want to get the date of the first day that they came to the gym, you can PARTITION BY customer (name/id) and ORDER BY purchase date. Then, in order to get the first row, you can simply filter for the rows with rowNumber equal to one.
This example shows how you can use ROW_NUMBER() to get the first date of when each member (user) visited.
with numbered_visits as ( SELECT memberId , visitDate , ROW_NUMBER() OVER (PARTITION BY customerId ORDER BY purchaseDate) as rowNumber FROM gym_visits )SELECT * FROM numbered_visits WHERE rowNumber = 1
To recap, if you ever need to get the first or last record, ROW_NUMBER() is a great way to achieve that.
5. Record Ranking with DENSE_RANK()
DENSE_RANK() is similar to ROW_NUMBER() except that it returns the same rank for equal values. Dense ranking is quite useful when it comes to retrieving the top records, for example:
- If you want to pull the top 10 most-watched Netflix shows this week
- If you want to get the top 100 users based on dollars spent
- If you want to see the behavior of the 1000 least active users
If you wanted to rank your top customers by total sales, DENSE_RANK() would be an appropriate function to use.
SELECT customerId , totalSales , DENSE_RANK() OVER (ORDER BY totalSales DESC) as rank FROM customers
Thanks for Reading!
And that’s all! I hope that this helps you in your interview prep — I’m sure that if you know these 5 concepts inside-out, you’ll do great when it comes to most SQL window function problems out there.
As always, I wish you the best in your learning endeavors!
Terence Shin is a data enthusiast with 3+ years of experience in SQL and 2+ years of experience in Python, and a blogger on Towards Data Science and KDnuggets.
Original. Reposted with permission.