R vs Python: head to head data analysis

The epic battle between R vs Python goes on. Here we are comparing both of them in terms of generic tasks of data scientist’s like reading CSV, finding data summary, PCA, model building, plotting, and many more.



Make clusters of the players

One good way to explore this kind of data is to generate cluster plots. These will show which players are most similar.

R

library(cluster)
set.seed(1)
isGoodCol <- function(col){
sum(is.na(col)) == 0 && is.numeric(col)
}
goodCols <- sapply(nba, isGoodCol)
clusters <- kmeans(nba[,goodCols], centers=5)
labels <- clusters$cluster

Python

from sklearn.cluster import KMeans
kmeans_model = KMeans(n_clusters=5, random_state=1)
good_columns = nba._get_numeric_data().dropna(axis=1)
kmeans_model.fit(good_columns)
labels = kmeans_model.labels_

In order to cluster properly, we remove any non-numeric columns, or columns with missing values (NA, Nan, etc). In R, we do this by applying a function across each column, and removing it if it has any missing values or isn’t numeric. We then use the cluster package to perform k-means and find 5 clusters in our data. We set a random seed using set.seed to be able to reproduce our results.

In Python, we use the main Python machine learning package, scikit-learn, to fit a k-means clustering model and get our cluster labels. We perform very similar methods to prepare the data that we used in R, except we use the get_numeric_data and dropna methods to remove non-numeric columns and columns with missing values.

Plot players by cluster

We can now plot out the players by cluster to discover patterns. One way to do this is to first use PCA to make our data 2-dimensional, then plot it, and shade each point according to cluster association.

R

nba2d <- prcomp(nba[,goodCols], center=TRUE)
twoColumns <- nba2d$x[,1:2]
clusplot(twoColumns, labels)

Python

from sklearn.decomposition import PCA
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(good_columns)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=labels)
plt.show()

Made a scatter plot of our data, and shaded or changed the icon of the data according to cluster. In R, the clusplot function was used, which is part of the cluster library. We performed PCA via the pccomp function that is builtin to R.

With Python, we used the PCA class in the scikit-learn library. We used matplotlib to create the plot.

Split into training and testing sets

If we want to do supervised machine learning, it’s a good idea to split the data into training and testing sets so we don’t overfit.

R

trainRowCount <- floor(0.8 * nrow(nba))
set.seed(1)
trainIndex <- sample(1:nrow(nba), trainRowCount)
train <- nba[trainIndex,]
test <- nba[-trainIndex,]

Python

train = nba.sample(frac=0.8, random_state=1)
test = nba.loc[~nba.index.isin(train.index)]

You’ll notice that R has many more data-analysis focused builtins, like floor, sample, and set.seed, whereas these are called via packages in Python (math.floor, random.sample, random.seed). In Python, the recent version of pandas came with a sample method that returns a certain proportion of rows randomly sampled from a source dataframe – this makes the code much more concise. In R, there are packages to make sampling simpler, but aren’t much more concise than using the built-in sample function. In both cases, we set a random seed to make the results reproducible.

Univariate linear regression

Let’s say we want to predict number of assists per player from field goals made per player.

R

fit <- lm(ast ~ fg, data=train)
predictions <- predict(fit, test)

Python

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train[["fg"]], train["ast"])
predictions = lr.predict(test[["fg"]])

Scikit-learn has a linear regression model that we can fit and generate predictions from. R relies on the built-in lm and predict functions. predict will behave differently depending on the kind of fitted model that is passed into it – it can be used with a variety of fitted models.

Calculate summary statistics for the model

R

summary(fit)

Call:
lm(formula = ast ~ fg, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-228.26  -35.38  -11.45   11.99  559.61
[output truncated]

Python

import statsmodels.formula.api as sm
model = sm.ols(formula='ast ~ fga', data=train)
fitted = model.fit()
fitted.summary()

OLS Regression Results                            
============================
Dep. Variable:                    ast   
R-squared:                       0.568
Model:                            OLS   
Adj. R-squared:                  0.567
[output truncated]

If we want to get summary statistics about the fit, like r-squared value, we’ll need to do a bit more in Python than in R. With R, we can use the builtin summary function to get information on the model. With Python, we need to use the statsmodels package, which enables many statistical methods to be used in Python. We get similar results, although generally it’s a bit harder to do statistical analysis in Python, and some statistical methods that exist in R don’t exist in Python.

Fit a random forest model

Our linear regression worked well in the single variable case, but we suspect there may be nonlinearities in the data. Thus, we want to fit a random forest model.

R

library(randomForest)
predictorColumns <- c("age", "mp", "fg", "trb", "stl", "blk")
rf <- randomForest(train[predictorColumns], train$ast, ntree=100)
predictions <- predict(rf, test[predictorColumns])

Python

from sklearn.ensemble import RandomForestRegressor
predictor_columns = ["age", "mp", "fg", "trb", "stl", "blk"]
rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=3)
rf.fit(train[predictor_columns], train["ast"])
predictions = rf.predict(test[predictor_columns])

The main difference here is that we needed to use the randomForest library in R to use the algorithm, whereas it was built in to scikit-learn in Python. scikit-learn has a unified interface for working with many different machine learning algorithms in Python, and there’s usually only one main implementation of each algorithm in Python. With R, there are many smaller packages containing individual algorithms, often with inconsistent ways to access them. This results in a greater diversity of algorithms (many have several implementations, and many are fresh out of research labs), but with a bit of a usability hit.