Follow Gregory Piatetsky, No. 1 on LinkedIn Top Voices in Data Science & Analytics

KDnuggets Home » News » 2015 » Mar » Opinions, Interviews, Reports » The Grammar of Data Science: Python vs R ( 15:n10 )

The Grammar of Data Science: Python vs R


 
  http likes 210

In this post, I will elaborate on my experience switching teams by comparing and contrasting R and Python solutions to some simple data exploration exercises.



By Deep Ganguli (Stitchfix).

Python and R are popular programming languages used by data scientists. Until recently, I exclusively used Python for exploratory data analysis, relying on Pandas and Seaborn for data manipulation and visualization. However, after seeing my colleagues do some amazing work in R with dplyr and ggplot2, I decided to take the plunge and learn how the other side lives. I found that I could more easily translate my ideas into code and beautiful visualizations with R than with Python. In this post, I will elaborate on my experience switching teams by comparing and contrasting R and Python solutions to some simple data exploration exercises.

Entering the Hadleyverse

My data exploration process has the following steps: hypothesize, get data, sanitize the data, compute descriptive statistics, plot things, drill down, rinse and repeat. I will use the (in)famous diamonds dataset that ships with ggplot2 to illustrate this process in R and Python. The dataset contains the carat size, cut, clarity, depth, table, price, and dimensional measurements of around 50 thousand diamonds. Here are the first five rows of data:

five rows of data


Some questions that come to mind are: How does the price of a diamond vary with the carat size? How does this relationship vary with categorical variables like cut, color, or clarity? I can answer all these questions with a quick visualization, using ggplot, in five lines of code:

library(ggplot2)
library(dplyr)
data(diamonds)
diamonds %>% 
  ggplot(aes(x=carat,y=price)) + 
  geom_point(alpha=0.5) +
  facet_grid(~ cut) + 
  stat_smooth(method = lm, formula = y ~ poly(x,2)) + 
  theme_bw()

The %>% operator (gross looking!) is a pipe which simply passes the output of the left operator as the first argument to the right operator. I tell ggplot with the aesthetic mapping (aes) that my independent variable x is carat, and my dependent variable y is price. The next graphical layer (geom_point) states that I want to represent each data point as a point, to eventually form a scatter plot. Next, I facet the scatter plot (facet_grid) on cut, although I can also facet on color or clarity. Finally, I have ggplot fit a second order linear model (stat_smooth) to the data, and display the fit on top of the scatterplot. I find the language for expressing the visualization to be intuitive, and the result is beautiful:

Example of R ggpairs

With this simple visualization, we can quickly see that price increases with carat size, the relationship is nonlinear, there are some outliers, and the relationship does not depend too heavily on cut.

Lost in translation with Python

How can we do the same thing in Python? With Seaborn, this can be done with lmplot,

from ggplot.exampledata import diamonds
import seaborn as sns

sns.set_style("white")
sns.lmplot("carat", "price", col="cut", data=diamonds, order=2)

which outputs a markedly less beautiful visualization:

Example of seaborn lmplot


Sign Up