The Grammar of Data Science: Python vs R
In this post, I will elaborate on my experience switching teams by comparing and contrasting R and Python solutions to some simple data exploration exercises.
Python and R are popular programming languages used by data scientists. Until recently, I exclusively used Python for exploratory data analysis, relying on Pandas and Seaborn for data manipulation and visualization. However, after seeing my colleagues do some amazing work in R with dplyr and ggplot2, I decided to take the plunge and learn how the other side lives. I found that I could more easily translate my ideas into code and beautiful visualizations with R than with Python. In this post, I will elaborate on my experience switching teams by comparing and contrasting R and Python solutions to some simple data exploration exercises.
Entering the Hadleyverse
My data exploration process has the following steps: hypothesize, get data, sanitize the data, compute descriptive statistics, plot things, drill down, rinse and repeat. I will use the (in)famous diamonds dataset that ships with ggplot2 to illustrate this process in R and Python. The dataset contains the carat size, cut, clarity, depth, table, price, and dimensional measurements of around 50 thousand diamonds. Here are the first five rows of data:
Some questions that come to mind are: How does the price of a diamond vary with the carat size? How does this relationship vary with categorical variables like cut, color, or clarity? I can answer all these questions with a quick visualization, using ggplot, in five lines of code:
library(ggplot2) library(dplyr) data(diamonds) diamonds %>% ggplot(aes(x=carat,y=price)) + geom_point(alpha=0.5) + facet_grid(~ cut) + stat_smooth(method = lm, formula = y ~ poly(x,2)) + theme_bw()
%>% operator (gross looking!) is a pipe which simply passes the output of the left operator as the first argument to the right operator. I tell ggplot with the aesthetic mapping (
aes) that my independent variable
x is carat, and my dependent variable
y is price. The next graphical layer (
geom_point) states that I want to represent each data point as a point, to eventually form a scatter plot. Next, I facet the scatter plot (
facet_grid) on cut, although I can also facet on color or clarity. Finally, I have ggplot fit a second order linear model (
stat_smooth) to the data, and display the fit on top of the scatterplot. I find the language for expressing the visualization to be intuitive, and the result is beautiful:
With this simple visualization, we can quickly see that price increases with carat size, the relationship is nonlinear, there are some outliers, and the relationship does not depend too heavily on cut.
Lost in translation with Python
How can we do the same thing in Python? With Seaborn, this can be done with lmplot,
from ggplot.exampledata import diamonds import seaborn as sns sns.set_style("white") sns.lmplot("carat", "price", col="cut", data=diamonds, order=2)
which outputs a markedly less beautiful visualization: