Making Python Speak SQL with pandasql
Want to wrangle Pandas data like you would SQL using Python? This post serves as an introduction to pandasql, and details how to get it up and running inside of Rodeo.
This post originally appeared on the Yhat blog. Yhat is a Brooklyn based company whose goal is to make data science applicable for developers, data scientists, and businesses alike. Yhat provides a software platform for deploying and managing predictive algorithms as REST APIs, while eliminating the painful engineering obstacles associated with production environments like testing, versioning, scaling and security.
One of my favorite things about Python is that users get the benefit of observing the R community and then emulating the best parts of it. I'm a big believer that a language is only as helpful as its libraries and tools.
This post is about
pandasql, a Python package we (Yhat) wrote that emulates the R package
sqldf. It's a small but mighty library comprised of just 358 lines of code. The idea of
pandasql is to make Python speak SQL. For those of you who come from a SQL-first background or still "think in SQL",
pandasql is a nice way to take advantage of the strengths of both languages.
In this introduction, we'll show you to get up and running with
pandasql inside of Rodeo, the integrated development environment (IDE) we built for data exploration and analysis. Rodeo is an open source and completely free tool. If you're an R user, its a comparable tool with a similar feel to RStudio. As of today, Rodeo can only run Python code, but last week we added syntax highlighting for a bunch of other languages to the editor (markdown, JSON, julia, SQL, markdown). As you may have read or guessed, we've got big plans for Rodeo, including adding SQL support so that you can run your SQL queries right inside of Rodeo, even without our handy little
pandasql. More on that in the next week or two!
Start by downloading Rodeo for Mac, Windows or Linux from the Rodeo page on the Yhat website.
ps If you download Rodeo and encounter a problem or simply have a question, we monitor our discourse forum 24/7 (okay, almost).
A bit of background, if you're curious
Behind the scenes,
pandasql uses the
pandas.io.sql module to transfer data between
DataFrame and SQLite databases. Operations are performed in SQL, the results returned, and the database is then torn down. The library makes heavy use of
frame_query, two functions which let you read and write to/from
pandas and (most) any SQL database.
pandasql using the package manager pane in Rodeo. Simply search for
pandasql and click Install Package.
You can also run
! pip install pandasql from the text editor if you prefer to install that way.
Check out the datasets
pandasql has two built-in datasets which we'll use for the examples below.
meat: Dataset from the U.S. Dept. of Agriculture containing metrics on livestock, dairy, and poultry outlook and production
births: Dataset from the United Nations Statistics Division containing demographic statistics on live births by month
Run the following code to check out the data sets.
Inside Rodeo, you really don't even need the print.variable.head() statements, since you can actually just examine the dataframes directly.
An odd graph
Notice that the plot appears both in the console and the plot tab (bottom right tab).
Tip: You can "pop out" your plot by clicking the arrows at the top of the pane. This is handy if you're working on multiple monitors and want to dedicate one just to your data visualzations.
To keep this post concise and easy to read, we've just given the code snippets and a few lines of results for most of the queries below.
If you're following along in Rodeo, a few tips as you're getting started:
Run Scriptwill indeed run everything you have written in the text editor
- You can highlight a code chunk and run it by clicking
Run Lineor pressing Command + Enter
- You can resize the panes (when I'm not making plots I shrink down the bottom right pane)
Write some SQL and execute it against your
DataFrame by substituting DataFrames for tables.
pandasql creates a DB, schema and all, loads your data, and runs your SQL.
pandasql supports aggregation. You can use aliased column names or column numbers in your
group by clause.
pandasql needs to have access to other variables in your session/environment. You can pass
pandasql when executing a SQL statement, but if you're running a lot of queries that might be a pain. To avoid passing locals all the time, you can add this helper function to your script to set
globals() like so:
You can join dataframes using normal SQL syntax.
It's just SQL
pandasql is powered by SQLite3, you can do most anything you can do in SQL. Here are some examples using common SQL features such as subqueries, order by, functions, and unions.
pandas is an incredible tool for data analysis in large part, we think, because it is extremely digestible, succinct, and expressive. Ultimately, there are tons of reasons to learn the nuances of
melt and other native
pandas features for slicing and dicing data. Check out the docs for some examples.
Our hope is that
pandasql will be a helpful learning tool for folks new to Python and
pandas. In my own personal experience learning R,
sqldf was a familiar interface helping me become highly productive with a new tool as quickly as possible.
We hope you'll check out
pandasql and Rodeo; if you do, please let us know what you think!
Original. Reposted with permission.