Analytics Patterns Every Data Scientist Should Master

Learn the analytics pattern you can use in most business analytics tasks.



Analytics Patterns in Data Science
Image by Author | Canva

 

Introduction

 
Data analysis problems aren’t really unique. However, “though your problems are non-unique, that doesn’t make them go away,” to paraphrase Neil Young. What will make them go away? Realizing that, beneath the surface, most of them rely on a handful of reusable patterns.

I’ll show you those patterns, so you can then reuse them in your work or job interview, no matter the data or industry. Data is always just that — data. All the patterns will be in PostgreSQL based on the coding interview questions on StrataScratch. Then I’ll tie them to real business situations.

 
Analytics Patterns in Data Science
 

1. Joins + Filters: Finding the Right Subset

 
Question: Movie Duration Match from Amazon

Task: Developing a feature that suggests individual movies from Amazon's content database that fit within a given flight's duration.

For flight 101, find movies whose runtime is less than or equal to the flight's duration. The output should list suggested movies for the flight, including flight_id, movie_id, and movie_duration.

Solution:

SELECT fs.flight_id,
       ec.movie_id,
       ec.duration AS movie_duration
FROM flight_schedule fs
JOIN entertainment_catalog ec ON ec.duration <= fs.flight_duration
WHERE fs.flight_id = 101
ORDER BY ec.duration;

 

Output:

 
Analytics Patterns in Data Science
 

Pattern: The join + filter is the pattern where you join two datasets and filter rows (in WHERE before aggregation or in HAVING after aggregation) based on conditions.

This is the sequence.

 
Analytics Patterns in Data Science
 

  1. Identify the primary table: The dataset that defines what you’re analyzing (flight_schedule)
  2. Join supplementary data: The dataset(s) that add context or attributes (entertainment_catalog)
  3. Apply filters: Remove the rows you don’t need (WHERE fs.flight_id = 101)

 

// Business Uses

  • HR: Joining employees with their working hours to identify overtime
  • Retail: Joining orders with product details to analyze product category performance
  • Streaming: Joining users with their sessions to find active viewers

 

2. Window Functions: Ranking & Ordering

 
Question: Top Posts Per Channel from Meta

Task: Identify the top 3 posts with the highest like counts for each channel. Assign a rank to each post based on its like count, allowing for gaps in ranking when posts have the same number of likes.

The output should display the channel name, post ID, post creation date, and the like count for each post.

Solution:

WITH ranked_posts AS
  (SELECT post_id,
          channel_id,
          created_at,
          likes,
          RANK() OVER (PARTITION BY channel_id ORDER BY likes DESC) AS post_rank
   FROM posts
   WHERE likes > 0)
   
SELECT c.channel_name,
       r.post_id,
       r.created_at,
       r.likes
FROM ranked_posts AS r
JOIN channels AS c ON r.channel_id = c.channel_id
WHERE r.post_rank <= 3;

 

Output:

 
Analytics Patterns in Data Science
 

Pattern: These are the window functions used for ranking.

  • RANK(): Ranking with gaps
  • DENSE_RANK(): Ranking without gaps
  • ROW_NUMBER(): Unique ordering with no ties

When ranking, follow this pattern.

 
Analytics Patterns in Data Science
 

  1. Partition the data: Define the logical group you’re analyzing (PARTITION BY channel_id)
  2. Order within each partition: Specify the ranking or time sequence (ORDER BY likes DESC)
  3. Apply the ranking window function — RANK(), DENSE_RANK() or ROW_NUMBER() OVER() depending on the task

 

// Business Uses

This pattern is used to identify top performers, for example:

  • Sales: Top sales representatives per region
  • Education: Ranking students by test scores within each class
  • Logistics: Ranking delivery drivers by completed deliveries within each region

 

3. Aggregation + Grouping: The Roll-Up Pattern

 
Question: Same-Day Orders from Walmart

Task: Find users who started a session and placed an order on the same day. Calculate the total number of orders placed on that day and the total order value for that day.

Solution:

SELECT s.user_id,
       s.session_date,
       COUNT(o.order_id) AS total_orders,
       SUM(o.order_value) AS total_order_value
FROM
  (SELECT DISTINCT user_id,
                   session_date
   FROM sessions) s
JOIN order_summary o ON s.user_id = o.user_id
AND s.session_date = o.order_date
GROUP BY s.user_id, s.session_date;

 

Output:

 
Analytics Patterns in Data Science
 

Pattern: This pattern is for summarizing data, e.g. across users, dates, products, or other analytical dimensions.

Here’s the sequence.

 
Analytics Patterns in Data Science
 

  1. Identify the grouping dimension: The column you want to group by (user_id and session_date)
  2. Group the data: Use GROUP BY on the chosen dimension(s) to group the data
  3. Aggregate the metrics: Summarize the values for each group using the aggregate functions
  4. Filter aggregated results (optional): Use HAVING to keep only certain groups depending on the aggregate value

 

// Business Uses

  • E-commerce: Orders and revenue per customer per day
  • SaaS: Logins per user per week
  • Finance: Transactions per account per quarter

 

4. Pivoting: Turning Rows Into Columns

 
Question: Highest Payment from the City of San Francisco

Task: Create a pivot table that shows the highest payment for each employee in each year, ordered by employee name in ascending order. The table should show years 2011, 2012, 2013, and 2014.

Solution:

SELECT employeename,
       MAX(pay_2011) AS pay_2011,
       MAX(pay_2012) AS pay_2012,
       MAX(pay_2013) AS pay_2013,
       MAX(pay_2014) AS pay_2014
FROM
    (SELECT employeename,
            CASE 
                WHEN year = 2011
                THEN totalpay
                ELSE 0
            END AS pay_2011,
            CASE 
                WHEN year = 2012
                THEN totalpay
                ELSE 0
            END AS pay_2012,
            CASE 
                WHEN year = 2013
                THEN totalpay
                ELSE 0
            END AS pay_2013,
            CASE 
                WHEN year = 2014
                THEN totalpay
                ELSE 0
            END AS pay_2014
    FROM sf_public_salaries) pmt
GROUP BY employeename
ORDER BY employeename;

 

Output:

 
Analytics Patterns in Data Science
 

Pattern: Pivoting turns row values into columns. This is useful when comparing metrics across years, categories, or segments.

Here’s the sequence.

 
Analytics Patterns in Data Science
 

  1. Identify the key column: The column you want to show as rows (employeename)
  2. Choose the pivot column: The field whose unique values will become new columns (year)
  3. Define the metric: Determine the metric you want to calculate and aggregate (totalpay)
  4. Create conditional columns: Use CASE WHEN (or PIVOT, where supported) to assign values to each column based on the pivot column
  5. Aggregate conditional expressions in the outer query: Aggregate each pivot column
  6. Group the data: Use GROUP BY on the key column to group the output

 

// Business Uses

  • Finance: Comparing revenue per quarter side-by-side
  • HR: Comparing salaries across years
  • Retail: Comparing monthly sales totals

 

5. Cumulative Metrics: Growth, Retention, and Progress

 
Question: Revenue Over Time from Amazon

Task: Calculate the 3-month rolling average of total revenue from purchases. The returns — represented by negative purchase values — shouldn’t be included in the calculation.

The output should show year-month (YYYY-MM) and the rolling average, sorted from the earliest to the latest month.

Solution:

SELECT t.month,
       AVG(t.monthly_revenue) OVER (ORDER BY t.month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS avg_revenue
FROM
  (SELECT to_char(created_at::date, 'YYYY-MM') AS month,
          SUM(purchase_amt) AS monthly_revenue
   FROM amazon_purchases
   WHERE purchase_amt > 0
   GROUP BY to_char(created_at::date, 'YYYY-MM')
   ORDER BY to_char(created_at::date, 'YYYY-MM')
) t
ORDER BY t.month ASC;

 

Output:

 
Analytics Patterns in Data Science
 

Pattern: Cumulative metrics (e.g. running total, moving average, or running count) are used to understand trends rather than showing individual time periods separately.

Here’s the sequence.

 
Analytics Patterns in Data Science
 

  1. Pre-aggregation (optional) by time-period: Summarize the analytical data into totals per the required time period (shown in the subquery)
  2. Apply the aggregate function: Use an aggregate function on the column you want to aggregate in the main query
  3. Turn the aggregate function into a window function: Use the OVER() clause
  4. Order the time periods: Sort the data within a partition chronologically so the cumulative calculation is applied correctly (ORDER BY t.month)
  5. Define the window frame: Define the number of previous or following periods to include in the cumulative calculation (ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)

 

// Business Uses

  • E-commerce: Running total revenue
  • SaaS: Cumulative active users
  • Product analytics: Cohort retention curves
  • Finance: Trailing averages
  • Operations: Rolling total of support tickets

 

6. Funnel Analysis: Tracking Sequential Behavior

 
Question: Penetration Analysis by Spotify

Task: We’ll have to revise the requirements. The new task is to measure how users progress through engagement stages on Spotify. Here are the stages of engagement:

  • Active — User had at least one session
  • Engaged — User had 5+ sessions
  • Power User — User had 5+ sessions and at least 10 listening hours in the past 30 days

For each country, calculate how many users reach each stage and the overall conversion rate from first activity to power user status.

Solution:

WITH base AS (
  SELECT country,
         user_id,
         sessions,
         listening_hours,
         last_active_date
  FROM penetration_analysis
),

stage_1 AS (
  SELECT DISTINCT user_id, country
  FROM base
  WHERE sessions > 0
),

stage_2 AS (
  SELECT DISTINCT user_id, country
  FROM base
  WHERE sessions >= 5
),

stage_3 AS (
  SELECT DISTINCT user_id, country
  FROM base
  WHERE sessions >= 5 AND listening_hours >= 10
)

SELECT country,
       COUNT(DISTINCT s1.user_id) AS users_started,
       COUNT(DISTINCT s2.user_id) AS engaged_5_sessions,
       COUNT(DISTINCT s3.user_id) AS power_users,
       ROUND(100.0 * COUNT(DISTINCT s3.user_id) / NULLIF(COUNT(DISTINCT s1.user_id), 0), 2
  ) AS conversion_rate
FROM stage_1 s1
LEFT JOIN stage_2 s2 USING (user_id, country)
LEFT JOIN stage_3 s3 USING (user_id, country)
GROUP BY country;

 

Output:

 
Analytics Patterns in Data Science
 

Pattern: Funnel analysis shows how users move through a series of ordered stages. Because the analysis depends on completing the one before it, it focuses on conversion and drop-off.

Here’s the sequence.

 
Analytics Patterns in Data Science
 

  1. Define the stages: Identify each step a user must complete
  2. Extract one dataset per stage: Write a common table expression (CTE) or subquery for each stage, containing only the users who qualify
  3. Ensure stage order: If needed, filter by timestamps or sequence rules so that later stages occur after earlier ones
  4. Join the stages: Join the stage datasets using LEFT JOIN to see how many users reach each step
  5. Count the users and calculate conversion rates: Compare the number of users in each stage

 

// Business Uses

  • E-commerce: Visit -> Add to Cart -> Purchase
  • SaaS: Signup -> Activate -> Retain
  • Streaming: Listen Once -> Engage Regularly -> Become Power User

 

7. Time-Based Comparison: Period-over-Period Metrics

 
Question: Daily Violation Counts from the City of San Francisco

Task: Determine the change in the number of daily violations by calculating the difference between the number of current and previous violations by inspection date.

Show the inspection date and the change in the number of daily violations, ordered from the earliest to the latest inspection.

Solution:

SELECT inspection_date::DATE,
       COUNT(violation_id) - LAG(COUNT(violation_id)) OVER(ORDER BY inspection_date::DATE) AS diff
FROM sf_restaurant_health_violations
GROUP BY 1
ORDER BY 1;

 

Output:

 
Analytics Patterns in Data Science
 

Pattern: This pattern is useful when you want to see how a metric changes over time.

Here’s the sequence.

 
Analytics Patterns in Data Science
 

  1. Aggregate data: Summarize the events into time periods (daily/weekly/monthly totals)
  2. Apply a window function: Use LAG() or LEAD() to access the values from the previous or the following period
  3. Order the time periods: Use the ORDER BY clause in OVER() to sort the data chronologically so comparisons are correct
  4. Calculate the difference: Subtract the prior value from the current value to get the difference

 

// Business Uses

  • Product: Day-to-day changes in active users
  • Operations: Daily changes in support volume
  • Finance: Month-over-month revenue deltas

 

Wrapping Up

 
Internalize those seven patterns and watch the data analysis problems dissolve before your eyes. I’m sure they will be helpful in many business situations and job interviews.
 
 

Nate Rosidi is a data scientist and in product strategy. He's also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

No, thanks!