KDnuggets Home » News » 2017 » Dec » Tutorials, Overviews » SQL Window Functions Tutorial for Business Analysis ( 18:n01 )

SQL Window Functions Tutorial for Business Analysis


In this SQL window functions tutorial, we will describe how these functions work in general, what is behind their syntax, and show how to answer these questions with pure SQL.



Running total

 
Another common request is to calculate running total over some period of time. This is the sum of the current element and all previous elements, for example, this is how the running total monthly revenue would look in our dataset:


Data visualized with Statsbot

And the query to get this is below:

WITH
monthly_revenue as (
    SELECT
    date_trunc(‘month’,datetime)::date as month,
    sum(amount) as revenue
    FROM orders
    GROUP BY 1
)
SELECT *,
sum(revenue) over (order by month rows between unbounded preceding and current row) as running_total
FROM monthly_revenue
ORDER BY 1


The new thing here is the rows between unbounded preceding and current rowpart that is called “frame clause.” It’s a way to tell you which subset of other rows of your result set you’re interested in, relative to the current row. The general definition of the frame clause is:

rows between frame_start and frame_end


where frame_start can be one of the following:

  • unbounded preceding which is “starting from the first row or ending on the last row of the window”
  • N preceding is N rows you’re interested in
  • current row

and frame_end can be:

  • unbounded following which is “starting from the first row or ending on the last row of the window”
  • N following is N rows you’re interested in
  • current row

It’s very flexible, except that you have to make sure the first part of between is higher than the second part, i.e. between 7 preceding and current row is totally fine, between 7 preceding and 3 preceding is fine too, but between 3 preceding and 7 preceding would throw an error. You can sum, count, and average values within the selected window. You can see a few examples in the query below:

SELECT
 *
,sum(amount) over () as amount_total
,sum(amount) over (order by order_id rows between unbounded preceding and current row) as running_sum
,sum(amount) over (partition by customer_id order by datetime rows between unbounded preceding and current row) as running_sum_by_customer
,avg(amount) over (order by datetime rows between 5 preceding and current row) as trailing_avg
FROM orders
ORDER BY 1


Every combination makes sense for particular use cases:

  • amount_total is specified without a window and returns the entire total of $3400. Use it when you want to get the total in the same query as individual rows without grouping and joining back the grouped result.
  • running_sum is the running total. Use it when you want to see how some variable such as revenue or website visits is accumulated over a period of time.
  • running_sum_by_customer is the same as above but broken down by segment (you can see how revenue from each individual customer grows, and on bigger data it can be cities or states).
  • trailing_avg shows the average amount of the last 5 orders. Use trailing average when you want to learn the trend and disguise volatility.

The ordering is critical here since the database needs to know how to sum the values. The result can be completely different when different ordering is applied. The picture below shows the result of the query:


Data visualized with Statsbot

The arrow next to running_sum tells us how the total amount is accumulated over time. The colored arrows next to running_sum_by_customer interconnect orders done by the same customer and the values show the total order amount of the given customer at the point of every order. Finally, the grey brackets next to trailing_avg reference the moving window of the last 5 rows.

 

Dealing with duplicate data

 
If you paid attention to the dataset you would probably notice that both orders for customer D on 02–05 have the same order_id=5 which doesn’t look right. How did this happen? It turns out that the original order was $250, then the customer decided to spend $50 more. The new record was inserted but the old record was not deleted. Such things happen in one or another table.

A long term solution in this case is to rectify the existing data flows and increase data awareness among developers, and what you can do right away is roll your sleeves up and clean the mess on your side:

SELECT *
FROM (
    SELECT *,
    row_number() over (partition by order_id order by datetime desc)
    FROM orders
)
WHERE row_number=1


That would get you all order records except one that you’d like to filter out. Let’s review the function that allows us to do so: row_number() returns the incremental counter of the row within the given window: 1,2,3, etc. It doesn’t take any parameters, that’s why it ends with empty brackets. The given window here is a set of rows that share a common order_id (partition by … is what separates these sets from each other), sorted by datetime descending, so the intermediary result of the subquery in the middle looks like this:


Data visualized with Statsbot

Every partition here is represented by a single row, except order_id=5. Inside the partition, you sorted the rows by datetime descending, so the latest row gets 1 and the earlier row gets 2. Then you filter only rows that have 1 to get rid of duplicates.

This is very useful for all sorts of duplicate problems. You might see that duplicates inflate the revenue number, so to calculate correct metrics we have to clean them out by combining the duplicate filtering query with the revenue growth query like this:

WITH
orders_cleaned as (
    SELECT *
    FROM (
        SELECT *,
        row_number() over (partition by order_id order by datetime desc)
        FROM orders
    )
    WHERE row_number=1
)
,monthly_revenue as (
    SELECT
    date_trunc(‘month’,datetime)::date as month,
    sum(amount) as revenue
    FROM orders_cleaned
    GROUP BY 1
)
,prev_month_revenue as (
    SELECT *,
    lag(revenue) over (order by month) as prev_month_revenue
    FROM monthly_revenue
)
SELECT *,
round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth
FROM prev_month_revenue
ORDER BY 1


Now the data is clean and the revenue and growth metrics are correct. To avoid using orders_cleaned step in every query, you can create a view and use it as a table reference in other queries:

CREATE VIEW orders_cleaned AS
SELECT order_id, customer_id, state, “datetime”, amount
FROM (
    SELECT *,
    row_number() over (partition by order_id order by datetime desc)
    FROM orders
)
WHERE row_number=1


Besides filtering duplicates, window partitions are very useful when you need to identify top N rows in every group.

 

Top N rows in every group

 
Finding top rows in every group is a typical task for a data analyst. Finding out who your best customers are and reaching out to them is a good way to know what people especially like about your company and make it a standard. It is equally useful for employees, as leaderboards can be very good motivation for any team in the company. Let’s see how it works. Considering our dataset is tiny, let’s get the top 2 orders for every month:

WITH
orders_ranked as (
    SELECT
    date_trunc(‘month’,datetime)::date as month,
    *,
    row_number() over (partition by date_trunc(‘month’,datetime) order by amount desc, datetime)
    FROM orders_cleaned
)
SELECT *
FROM orders_ranked
WHERE row_number<=2
ORDER BY 1


The intermediary result of the orders_ranked statement would look like this (rows that appear in the final result are highlighted):


Data visualized with Statsbot

You can use any expression to separate partitions in window specification, not only column name (here we separated them by month, every partition is highlighted by its own color).

There are orders with the same month and amount, like order_id=1 and order_id=3, so we decided to resolve this by picking up the earliest order, adding datetime column to the sorting. If you’re interested in pulling both rows in case of conflict you can use rank function instead of row_number.

 

Repeat purchase behavior

 
Repeat purchase behavior is the key for a successful business, and investors totally love companies that can retain customers and make them spend more and more. We can translate this to a more precise data question: what is the repeat purchase rate and the typical difference between the first order and the second order amount? That would be expressed as:

WITH
customer_orders as (
    SELECT *,
    row_number() over (partition by customer_id order by datetime) as customer_order_n,
    lag(amount) over (partition by customer_id order by datetime) as prev_order_amount
    FROM orders_cleaned
)
SELECT
round(100.0*sum(case when customer_order_n=2 then 1 end)/count(distinct customer_id),1) as repeat_purchases,
avg(case when customer_order_n=2 then 1.0*amount/prev_order_amount end) as revenue_expansion
FROM customer_orders


The result of the intermediary customer_orders statement would look like this:


Data visualized with Statsbot

Every customer’s partition is highlighted by its own color. The brackets next to customer_id outline our partitions, the arrows next to datetime show sorting direction, the arrows that point to amount values show where prev_order_amount values come from, and customer_order_n values for repeat purchases are underlined. Since customers D, E, and F have only one order they are out of the analysis.

The final result looks like this:


Data visualized with Statsbot

This time we partitioned by customer_id to isolate sets of order rows that belong to the same customer, and identified the row order and the previous row’s amount value. In the final query, we have used conditional aggregates to calculate desired metrics which tell us that half of customers buy again and they spend almost twice as much on the second order. Imagine it’s not a dummy dataset, what a great business would that be!

 

Wrap up

 
As you have seen in our tutorial, SQL window functions are a powerful concept that allows advanced calculations. A typical window function consists of expression and window specification with optional partitioning and frame clause, and the opportunities to slice, smooth, interconnect, and deduplicate your data are truly boundless. There are plenty of interesting use cases where these functions give a great value. We’ll continue talking about them in the next few tutorials. Stay tuned!

 
Bio: Alex Yeskov is a data/geospatial enthusiast, EM alumni, and analytics lead at ezhome.

Original. Reposted with permission.

Related:


Sign Up