Using Pandas and SQL Together for Data Analysis
In this tutorial, we’ll explore when and how SQL functionality can be integrated within the Pandas framework, as well as its limitations.
Image by Author
SQL, or Structured Query Language, has long been the go-to tool for data management, but there are times when it falls short, requiring the power and flexibility of a tool such as Python. Python, a versatile multipurpose programming language, excels at accessing, extracting, wrangling, and exploring data from relational databases. Within Python, the open-source library Pandas is specifically crafted for data manipulation and analysis.
In this tutorial, we’ll explore when and how SQL functionality can be integrated within the Pandas framework, as well as its limitations.
The main question you might wondering right now is...
Why Use Both?
The reason lies in readability and familiarity: in certain cases, especially in complex workflows, SQL queries can be much clearer and easier to read than equivalent Pandas code. This is particularly true for those who started working with data in SQL before transitioning to Pandas.
Moreover, since most data originates from databases, SQL — being the native language of these databases — offers a natural advantage. This is why many data professionals, particularly data scientists, often integrate both SQL and Python (specifically, Pandas) within the same data pipeline to leverage the strengths of each.
To see SQL readability in action, let’s use the following pokemon gen1 pokedex csv file.
Imagine we want to sort the DataFrame by the "Total" column in ascending order and display the top 5. Now we can compare how to perform the same action with both Pandas and SQL.
Using Pandas with Python:
data[["#", "Name", "Total"]].sort_values(by="Total", ascending=True).head(5)
Using SQL:
SELECT
"#",
Name,
Total
FROM data
ORDER BY Total
LIMIT 5
You see how different both are right? But... how can we combine both languages within our working environment with Python?
The solution is using PandaSQL!
Using PandaSQL
Pandas is a powerful open-source data analysis and manipulation python library. PandaSQL allows the use of SQL syntax to query Pandas DataFrames. For people new to Pandas, PandaSQL tries to make data manipulation and cleanup more familiar. You can use PandaSQL to query Pandas DataFrames using SQL syntax.
Let's take a look.
First, we need to install PandaSQL:
pip install pandasql
Then (as always), we import the required packages:
from pandasql import sqldf
Here, we directly imported the sqldf
function from PandaSQL, which is essentially the library's core feature. As the name suggests, sqldf
allows you to query DataFrames using SQL syntax.
sqldf(query_string, env=None)
In this context, query_string
is a required parameter that accepts a SQL query in string format. The env
parameter, optional and seldom used, can be set to either locals()
or globals()
, enabling sqldf
to access variables from the specified scope in your Python environment.
Beyond this function, PandaSQL also includes two basic built-in datasets that can be loaded with the straightforward functions: load_births()
and load_meat()
. This way you have some dummy data to play with built right in.
So now, if we want to execute the previous SQL query within our Python Jupyter notebook, it would be something like the following:
from pandasql import sqldf
import pandas as pd
sqldf('''
SELECT "#", Name, Total
FROM data
ORDER BY Total
LIMIT 5''')
The sqldf
function returns the result of a query as a Pandas DataFrame.
When should we use it
The pandasql library enables data manipulation using SQL's Data Query Language (DQL), providing a familiar, SQL-based approach to interact with data in Pandas DataFrames.
With pandasql, you can execute queries directly on your dataset, allowing for efficient data retrieval, filtering, sorting, grouping, joining, and aggregation.
Additionally, it supports performing mathematical and logical operations, making it a powerful tool for SQL-savvy users working with data in Python.
PandaSQL is limited to SQL’s Data Query Language (DQL) subset, meaning it does not support modifying tables or data—actions like UPDATE
, INSERT
, or DELETE
are not available.
Additionally, since PandaSQL relies on SQL syntax, specifically SQLite, it’s essential to be mindful of SQLite-specific quirks that may affect query behavior.
Comparing PandasSQL and SQL
This section demonstrates how PandaSQL and Pandas can both be used to achieve similar outcomes, offering side-by-side comparisons to highlight their respective approaches.
Generating Multiple Tables
Let's generate subsets of data from a larger dataset, creating tables like types, legendaries, generations, and features. Using PandaSQL, we can specify SQL queries to select specific columns, making it easy to extract the exact data we want.
Using PandaSQL:
types = sqldf('''
SELECT "#", Name, "Type 1", "Type 2"
FROM data''')
legendaries = sqldf('''
SELECT "#", Name, Legendary
FROM data''')
generations = sqldf('''
SELECT "#", Name, Generation
FROM data''')
features = sqldf('''
SELECT "#", Name, Total, HP, Attack, Defense, "Sp. Atk", "Sp. Def","Speed"
FROM data''')
Here, PandaSQL allows for a clean, SQL-based selection syntax that can feel intuitive to users familiar with relational databases. It is particularly useful if data selection involves complex conditions or SQL functions.
Using pure Python:
# Selecting columns for types
types = data[['#', 'Name', 'Type 1', 'Type 2']]
# Selecting columns for legendaries
legendaries = data[['#','Name', 'Legendary']]
# Selecting columns for generations
generations = data[['#','Name', 'Generation']]
# Selecting columns for features
features = data[['#','Name', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]
In pure Python, we achieve the same outcome by simply specifying column names within square brackets. While this is efficient for straightforward column selection, it may become less readable with more complex filtering or grouping conditions, where SQL-style syntax can be more natural.
Performing JOINs
Joins are a powerful way to combine data from multiple sources based on common columns, and both PandaSQL and Pandas support this.
First, PandaSQL:
types_features = sqldf('''
SELECT
t1.*,
t2.Total,
t2.HP,
t2.Attack,
t2.Defense,
t2."Sp. Atk",
t2."Sp. Def",
t2."Speed"
FROM types AS t1
LEFT JOIN features AS t2
ON t1."#" = t2."#"
AND t1.Name = t2.Name
''')
Using SQL, this LEFT JOIN combines types and features based on matching values in the # and Name columns. This approach is simple for SQL users, with clear syntax for selecting specific columns and combining data from multiple tables.
In pure Python:
# Performing a left join between `types` and `features` on the columns "#" and "Name"
types_features = types.merge(
features,
on=['#', 'Name'],
how='left'
)
types_features
In pure Python, we accomplish the same result using the merge()
function, specifying on for matching columns and how='left'
to perform a left join. Pandas makes it easy to merge on multiple columns and offers flexibility in specifying join types. However, the SQL-style join syntax can be more readable when working with larger tables or performing more complex joins.
Custom Query
In this example, we retrieve the top 5 records based on "Defense", sorted in descending order.
PandaSQL:
top_5_defense = sqldf('''
SELECT
Name, Defense
FROM features
ORDER BY Defense DESC
LIMIT 5
''')
The SQL query sorts features by the Defense column in descending order and limits the result to the top 5 entries. This approach is direct, especially for SQL users, with the ORDER BY
and LIMIT
keywords making it clear what the query does.
And in pure Python:
top_5_defense = features[['Name', 'Defense']].sort_values(by='Defense', ascending=False).head(5)
Using only Python, we achieve the same result using sort_values()
to order by Defense and then head(5)
to limit the output. Pandas provides a flexible and intuitive syntax for sorting and selecting records, though the SQL approach may still be more familiar to those who regularly work with databases.
Conclusion
In this tutorial, we examined how and when combining SQL functionality with Pandas can help produce cleaner, more efficient code. We covered the setup and use of the PandaSQL library, along with its limitations, and walked through popular examples to compare PandaSQL code with equivalent Pandas Python code.
By comparing these approaches, you can see that PandaSQL is helpful for SQL-native users or scenarios with complex queries, while native Pandas code can be more Pythonic and integrated for those accustomed to working in Python.
You can check all code displayed here in the following Jupyter Notebook
Josep Ferrer is an analytics engineer from Barcelona. He graduated in physics engineering and is currently working in the data science field applied to human mobility. He is a part-time content creator focused on data science and technology. Josep writes on all things AI, covering the application of the ongoing explosion in the field.