Building Your Modern Data Analytics Stack with Python, Parquet, and DuckDB

Modern data analytics doesn’t have to be complex. Learn how Python, Parquet, and DuckDB work together in practice.



Building Your Modern Data Analytics Stack with Python, Parquet, and DuckDB
Image by Author

 

Introduction

 
Data analytics has changed in recent years. The traditional approach of loading everything into a relational database and running SQL queries still works, but it is often overkill for some analytical workloads. Storing data in Parquet files and querying them directly with DuckDB is faster, simpler, and more effective.

In this article, I will show you how to build a data analytics stack in Python that uses DuckDB to query data stored in Parquet files. We will work with a sample dataset, explore how each component works, and understand why this approach can be useful for your data science projects.

You can find the code on GitHub.

 

Prerequisites

 
Before we start, make sure you have:

  • Python 3.10 or a later version installed
  • An understanding of SQL basics and pandas DataFrame operations
  • Familiarity with data analysis concepts

Also install the required libraries:

pip install duckdb pandas pyarrow numpy faker

 

Understanding the Recommended Data Analytics Stack

 
Let's start by understanding what each component does and why they work well together.

Parquet is a columnar storage format originally created for the Hadoop ecosystem. Unlike row-based formats like CSV where each line is a complete record, Parquet organizes data by columns. This might seem like a small difference, but it has huge implications for analytics.

When you run a query that only needs three columns from a table with fifty columns, Parquet lets you read only those three columns. With CSV, you have to read every row completely and then throw away the 47 columns you do not need. This makes Parquet faster for typical analytical queries. Additionally, columnar storage compresses well because values in the same column tend to be similar.

DuckDB is an embedded analytical database. While SQLite is optimized for transactional workloads requiring many small reads and writes, DuckDB is designed specifically for analytical queries that require scanning large amounts of data, aggregations, and joins. The embedded part means it runs inside your Python process, so there is no separate database server to install or manage.

What makes DuckDB special for analytics is that it can query Parquet files directly. You do not need to import data into the database first. Point DuckDB at a Parquet file, write SQL, and it reads only what it needs. This "query in place" capability is what makes the whole stack useful.

You can use this in your Python development environment. You store data in Parquet files, pandas handles data manipulation, DuckDB executes analytical queries, and the entire Python ecosystem is available for visualization, machine learning, and automation.

 

Creating a Sample Dataset

 
We will use an e-commerce dataset. You can use the data_generator.py script to generate the sample dataset or follow this notebook.

The dataset includes customers who place orders, orders that contain multiple items, and products with categories and pricing.

The data has referential integrity. Every order references a valid customer, and every order item references both a valid order and product. This lets us perform meaningful joins and aggregations.

 

Saving Data to a Parquet File

 
Before we save our data, let's understand why Parquet is effective for analytics. We have already discussed the advantages of columnar storage formats like Parquet, but let's go over it again, this time in greater detail.

In a CSV file, data is stored row by row. If you have a million rows with 50 columns each, and you want to analyze just one column, you still need to read all 50 million values to skip past the columns you do not need. This is wasteful.

Parquet, as we now know, stores data column by column. All values for one column are stored together. When you query one column, you read exactly that column and nothing else. For analytical queries that typically touch a small number of columns, this is much faster.

Columnar storage also compresses better. Values in the same column tend to be similar — they usually are all integers, all dates, or all from the same categorical set. Compression algorithms work much better on similar data than random data.

Let's save our data as Parquet and see the benefits:

# Save tables as Parquet files
customers_df.to_parquet('customers.parquet', engine='pyarrow', compression='snappy')
products_df.to_parquet('products.parquet', engine='pyarrow', compression='snappy')
orders_df.to_parquet('orders.parquet', engine='pyarrow', compression='snappy')
order_items_df.to_parquet('order_items.parquet', engine='pyarrow', compression='snappy')

# Compare with CSV to see the difference
customers_df.to_csv('customers.csv', index=False)
orders_df.to_csv('orders.csv', index=False)

import os

def get_size_mb(filename):
    return os.path.getsize(filename) / (1024 * 1024)

print("Storage Comparison:")
print(f"customers.csv:     {get_size_mb('customers.csv'):.2f} MB")
print(f"customers.parquet: {get_size_mb('customers.parquet'):.2f} MB")
print(f"Savings: {(1 - get_size_mb('customers.parquet')/get_size_mb('customers.csv'))*100:.1f}%\n")

print(f"orders.csv:        {get_size_mb('orders.csv'):.2f} MB")
print(f"orders.parquet:    {get_size_mb('orders.parquet'):.2f} MB")
print(f"Savings: {(1 - get_size_mb('orders.parquet')/get_size_mb('orders.csv'))*100:.1f}%")

 
Output:

Storage Comparison:
customers.csv:     0.73 MB
customers.parquet: 0.38 MB
Savings: 48.5%

orders.csv:        3.01 MB
orders.parquet:    1.25 MB
Savings: 58.5%

 

These compression ratios are typical. Parquet commonly achieves better compression compared to CSV. The compression we are using here is Snappy, which prioritizes speed over maximum compression.

Note: Parquet supports other codecs like Gzip, which offers better compression but is slower, and Zstd for a good balance between compression and speed.

 

Querying Parquet Files with DuckDB

 
Now comes the interesting part. We can query these Parquet files directly using SQL without loading them into a database first.

import duckdb

# Create a DuckDB connection
con = duckdb.connect(database=':memory:')

# Query the Parquet file directly
query = """
SELECT
    customer_segment,
    COUNT(*) as num_customers,
    COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () as percentage
FROM 'customers.parquet'
GROUP BY customer_segment
ORDER BY num_customers DESC
"""

result = con.execute(query).fetchdf()
print("Customer Distribution:")
print(result)

 

Output:

Customer Distribution:
  customer_segment  num_customers  percentage
0          Standard           5070       50.70
1             Basic           2887       28.87
2           Premium           2043       20.43

 

Look at the query syntax: FROM 'customers.parquet'. DuckDB reads the file directly. There is no import step, no CREATE TABLE statement, and no waiting for data to load. You write SQL, DuckDB figures out what data it needs from the file, and returns results.

In traditional workflows, you would need to create a database, define schemas, import data, create indexes, and then finally query. With DuckDB and Parquet, you skip all that. Under the hood, DuckDB reads the Parquet file metadata to understand the schema, then uses predicate pushdown to skip reading data that does not match your WHERE clause. It only reads the columns your query actually uses. For large files, this makes queries super fast.

 

Performing Complex Analytics

 
Let's run a slightly more complex analytical query. We will analyze monthly revenue trends broken down by customer segment.

query = """
SELECT
    strftime(o.order_date, '%Y-%m') as month,
    c.customer_segment,
    COUNT(DISTINCT o.order_id) as num_orders,
    COUNT(DISTINCT o.customer_id) as unique_customers,
    ROUND(SUM(o.order_total), 2) as total_revenue,
    ROUND(AVG(o.order_total), 2) as avg_order_value
FROM 'orders.parquet' AS o
JOIN 'customers.parquet' AS c
  ON o.customer_id = c.customer_id
WHERE o.payment_status = 'completed'
GROUP BY month, c.customer_segment
ORDER BY month DESC, total_revenue DESC
LIMIT 15
"""

monthly_revenue = con.execute(query).fetchdf()
print("Recent Monthly Revenue by Segment:")
print(monthly_revenue.to_string(index=False))

 
Output:

Recent Monthly Revenue by Segment:
  month customer_segment  num_orders  unique_customers  total_revenue  avg_order_value
2026-01          Standard        2600              1468     1683223.68           647.39
2026-01             Basic        1585               857     1031126.44           650.55
2026-01           Premium         970               560      914105.61           942.38
2025-12          Standard        2254              1571     1533076.22           680.16
2025-12           Premium         885               613      921775.85          1041.55
2025-12             Basic        1297               876      889270.86           685.64
2025-11          Standard        1795              1359     1241006.08           691.37
2025-11           Premium         725               554      717625.75           989.83
2025-11             Basic        1012               767      682270.44           674.18
2025-10          Standard        1646              1296     1118400.61           679.47
2025-10           Premium         702               550      695913.24           991.33
2025-10             Basic         988               769      688428.86           696.79
2025-09          Standard        1446              1181      970017.17           670.83
2025-09           Premium         594               485      577486.81           972.20
2025-09             Basic         750               618      495726.69           660.97

 

This query groups by two dimensions (month and segment), aggregates multiple metrics, and filters on payment status. It is the kind of query you would write constantly in analytical work. The strftime function formats dates directly in SQL. The ROUND function cleans up decimal places. Multiple aggregations run efficiently and give the expected results.

 

Joining Multiple Tables

 
Real analytics rarely involves a single table. Let's join our tables to answer a business question: which product categories generate the most revenue, and how does this vary by customer segment?

query = """
SELECT
    p.category,
    c.customer_segment,
    COUNT(DISTINCT oi.order_id) as num_orders,
    SUM(oi.quantity) as units_sold,
    ROUND(SUM(oi.item_total), 2) as total_revenue,
    ROUND(AVG(oi.item_total), 2) as avg_item_value
FROM 'order_items.parquet' oi
JOIN 'orders.parquet' o ON oi.order_id = o.order_id
JOIN 'products.parquet' p ON oi.product_id = p.product_id
JOIN 'customers.parquet' c ON o.customer_id = c.customer_id
WHERE o.payment_status = 'completed'
GROUP BY p.category, c.customer_segment
ORDER BY total_revenue DESC
LIMIT 20
"""

category_analysis = con.execute(query).fetchdf()
print("Revenue by Category and Customer Segment:")
print(category_analysis.to_string(index=False))

 
Truncated output:

Revenue by Category and Customer Segment:
     category customer_segment  num_orders  units_sold  total_revenue  avg_item_value
  Electronics          Standard        4729      6431.0     6638814.75         1299.18
  Electronics           Premium        2597      3723.0     3816429.62         1292.39
  Electronics             Basic        2685      3566.0     3585652.92         1240.28
   Automotive          Standard        4506      5926.0     3050679.12          633.18
       Sports          Standard        5049      6898.0     2745487.54          497.55
...
...
     Clothing           Premium        3028      4342.0      400704.25          114.55
     Clothing             Basic        3102      4285.0      400391.18          117.49
        Books          Standard        6196      8511.0      252357.39           36.74

 

This query joins three tables. DuckDB automatically determines the optimal join order and execution strategy. Notice how readable the SQL is compared to equivalent pandas code. For complex analytical logic, SQL often expresses intent more clearly than DataFrame operations.

 

Understanding Query Performance

 
Let's compare DuckDB with pandas for a common analytical task.

 

// Method 1: Using Pandas

import time

# Analytical task: Calculate customer purchase patterns
print("Performance Comparison: Customer Purchase Analysis\n")

start_time = time.time()

# Merge dataframes
merged = order_items_df.merge(orders_df, on='order_id')
merged = merged.merge(products_df, on='product_id')

# Filter completed orders
completed = merged[merged['payment_status'] == 'completed']

# Group and aggregate
customer_patterns = completed.groupby('customer_id').agg({
    'order_id': 'nunique',
    'product_id': 'nunique',
    'item_total': ['sum', 'mean'],
    'category': lambda x: x.mode()[0] if len(x) > 0 else None
})

customer_patterns.columns = ['num_orders', 'unique_products', 'total_spent', 'avg_spent', 'favorite_category']
customer_patterns = customer_patterns.sort_values('total_spent', ascending=False).head(100)

pandas_time = time.time() - start_time

 

// Method 2: Using DuckDB

start_time = time.time()

query = """
SELECT
    o.customer_id,
    COUNT(DISTINCT oi.order_id) as num_orders,
    COUNT(DISTINCT oi.product_id) as unique_products,
    ROUND(SUM(oi.item_total), 2) as total_spent,
    ROUND(AVG(oi.item_total), 2) as avg_spent,
    MODE(p.category) as favorite_category
FROM 'order_items.parquet' oi
JOIN 'orders.parquet' o ON oi.order_id = o.order_id
JOIN 'products.parquet' p ON oi.product_id = p.product_id
WHERE o.payment_status = 'completed'
GROUP BY o.customer_id
ORDER BY total_spent DESC
LIMIT 100
"""

duckdb_result = con.execute(query).fetchdf()
duckdb_time = time.time() - start_time

print(f"Pandas execution time:  {pandas_time:.4f} seconds")
print(f"DuckDB execution time:  {duckdb_time:.4f} seconds")
print(f"Speedup: {pandas_time/duckdb_time:.1f}x faster with DuckDB\n")

print("Top 5 customers by total spent:")
print(duckdb_result.head().to_string(index=False))

 

Output:

Performance Comparison: Customer Purchase Analysis

Pandas execution time:  1.9872 seconds
DuckDB execution time:  0.1171 seconds
Speedup: 17.0x faster with DuckDB

Top 5 customers by total spent:
 customer_id  num_orders  unique_products  total_spent  avg_spent favorite_category
        8747           8               24     21103.21     879.30       Electronics
         617           9               27     19596.22     725.79       Electronics
        2579           9               18     17011.30     895.33            Sports
        6242           7               23     16781.11     729.61       Electronics
        5443           8               22     16697.02     758.96        Automotive

 

DuckDB is about 17x faster. This performance gap is more pronounced with larger datasets. The pandas approach loads all data into memory, performs multiple merge operations (which create copies), and then aggregates. DuckDB reads directly from Parquet files, pushes filters down to avoid reading unnecessary data, and uses optimized join algorithms.

 

Building Reusable Analytics Queries

 
In production analytics, you will run similar queries repeatedly with different parameters. Let's build a reusable function that follows best practices for this workflow.

def analyze_product_performance(con, category=None, min_revenue=None, date_from=None, top_n=20):
    """
    Analyze product performance with flexible filtering.

    This demonstrates how to build reusable analytical queries that can be
    parameterized for different use cases. In production, you'd build a library
    of these functions for common analytical questions.
    """

    # Build the WHERE clause dynamically based on parameters
    where_clauses = ["o.payment_status = 'completed'"]

    if category:
        where_clauses.append(f"p.category = '{category}'")

    if date_from:
        where_clauses.append(f"o.order_date >= '{date_from}'")

    where_clause = " AND ".join(where_clauses)

    # Main analytical query
    query = f"""
    WITH product_metrics AS (
        SELECT
            p.product_id,
            p.product_name,
            p.category,
            p.base_price,
            COUNT(DISTINCT oi.order_id) as times_ordered,
            SUM(oi.quantity) as units_sold,
            ROUND(SUM(oi.item_total), 2) as total_revenue,
            ROUND(AVG(oi.unit_price), 2) as avg_selling_price,
            ROUND(SUM(oi.item_total) - (p.cost * SUM(oi.quantity)), 2) as profit
        FROM 'order_items.parquet' oi
        JOIN 'orders.parquet' o ON oi.order_id = o.order_id
        JOIN 'products.parquet' p ON oi.product_id = p.product_id
        WHERE {where_clause}
        GROUP BY p.product_id, p.product_name, p.category, p.base_price, p.cost
    )
    SELECT
        *,
        ROUND(100.0 * profit / total_revenue, 2) as profit_margin_pct,
        ROUND(avg_selling_price / base_price, 2) as price_realization
    FROM product_metrics
    """

    # Add revenue filter if specified
    if min_revenue:
        query += f" WHERE total_revenue >= {min_revenue}"

    query += f"""
    ORDER BY total_revenue DESC
    LIMIT {top_n}
    """

    return con.execute(query).fetchdf()

 

This function does the following. First, it builds SQL dynamically based on parameters, allowing flexible filtering without writing separate queries for each case. Second, it uses a Common Table Expression (CTE) to organize complex logic into readable steps. Third, it calculates derived metrics like profit margin and price realization that require multiple source columns.

The profit calculation subtracts costs from revenue using data from both the order items and products tables. This kind of cross-table calculation is straightforward in SQL but would be cumbersome with multiple pandas operations. DuckDB handles it efficiently in a single query.

Here is an example that uses the function above:

# Example 1: Top electronics products
electronics = analyze_product_performance(con, category='Electronics', top_n=10)
print("Top 10 Electronics Products:")
print(electronics[['product_name', 'units_sold', 'total_revenue', 'profit_margin_pct']].to_string(index=False))

 

Output:

Top 10 Electronics Products:
        product_name  units_sold  total_revenue  profit_margin_pct
Electronics Item 113       262.0      510331.81              38.57
Electronics Item 154       289.0      486307.74              38.28
Electronics Item 122       229.0      448680.64              38.88
Electronics Item 472       251.0      444680.20              38.51
Electronics Item 368       222.0      424057.14              38.96
Electronics Item 241       219.0      407648.10              38.75
Electronics Item 410       243.0      400078.65              38.31
Electronics Item 104       233.0      400036.84              38.73
  Electronics Item 2       213.0      382583.85              38.76
Electronics Item 341       240.0      376722.94              38.94

 

And here is another example:

# Example 2: High-revenue products across all categories
print("\n\nHigh-Revenue Products (>$50k revenue):")
high_revenue = analyze_product_performance(con, min_revenue=50000, top_n=10)
print(high_revenue[['product_name', 'category', 'total_revenue', 'profit']].to_string(index=False))

 

Output:

High-Revenue Products (>$50k revenue):
        product_name     category  total_revenue     profit
Electronics Item 113 Electronics      510331.81  196846.19
Electronics Item 154 Electronics      486307.74  186140.78
Electronics Item 122 Electronics      448680.64  174439.40
Electronics Item 472 Electronics      444680.20  171240.80
Electronics Item 368 Electronics      424057.14  165194.04
Electronics Item 241 Electronics      407648.10  157955.25
Electronics Item 410 Electronics      400078.65  153270.84
Electronics Item 104 Electronics      400036.84  154953.46
  Electronics Item 2 Electronics      382583.85  148305.15
Electronics Item 341 Electronics      376722.94  146682.94

 

Wrapping Up

 
In this article, we analyzed e-commerce data. We generated relational data, stored it as Parquet, and queried it with DuckDB. The performance comparisons showed substantial speedups compared to traditional pandas approaches.

Use this stack when you are performing analytical workloads on structured data. If you are aggregating, filtering, joining, and computing metrics, this is useful. It works well for data that changes in batches rather than constantly. If you are analyzing yesterday's sales, processing monthly reports, or exploring historical trends, Parquet files updated periodically work great. You do not need a live database constantly accepting writes.

However, this stack is not suitable for everything:

  • If you need real-time updates with many concurrent writers, you need a traditional database with ACID transactions
  • If you are building an application with user-facing queries that need millisecond response times, an indexed database is better
  • If multiple users need to query simultaneously with different access permissions, a database server provides better control

The sweet spot is analytical work on large datasets, where data updates happen in batches and you need fast, flexible querying and analysis.

Happy analyzing!
 
 

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she's working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

No, thanks!