R vs Python: head to head data analysis
The epic battle between R vs Python goes on. Here we are comparing both of them in terms of generic tasks of data scientist’s like reading CSV, finding data summary, PCA, model building, plotting, and many more.
Make clusters of the players
One good way to explore this kind of data is to generate cluster plots. These will show which players are most similar.
R
library(cluster)
set.seed(1)
isGoodCol <- function(col){
sum(is.na(col)) == 0 && is.numeric(col)
}
goodCols <- sapply(nba, isGoodCol)
clusters <- kmeans(nba[,goodCols], centers=5)
labels <- clusters$cluster
Python
from sklearn.cluster import KMeans
kmeans_model = KMeans(n_clusters=5, random_state=1)
good_columns = nba._get_numeric_data().dropna(axis=1)
kmeans_model.fit(good_columns)
labels = kmeans_model.labels_
In order to cluster properly, we remove any non-numeric columns, or columns with missing values (NA
, Nan
, etc). In R, we do this by applying a function across each column, and removing it if it has any missing values or isn’t numeric. We then use the cluster package to perform k-means and find 5
clusters in our data. We set a random seed using set.seed
to be able to reproduce our results.
In Python, we use the main Python machine learning package, scikit-learn, to fit a k-means clustering model and get our cluster labels. We perform very similar methods to prepare the data that we used in R, except we use the get_numeric_data
and dropna
methods to remove non-numeric columns and columns with missing values.
Plot players by cluster
We can now plot out the players by cluster to discover patterns. One way to do this is to first use PCA to make our data 2-dimensional, then plot it, and shade each point according to cluster association.
R
nba2d <- prcomp(nba[,goodCols], center=TRUE)
twoColumns <- nba2d$x[,1:2]
clusplot(twoColumns, labels)
Python
from sklearn.decomposition import PCA
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(good_columns)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=labels)
plt.show()
Made a scatter plot of our data, and shaded or changed the icon of the data according to cluster. In R, the clusplot
function was used, which is part of the cluster library. We performed PCA via the pccomp
function that is builtin to R.
With Python, we used the PCA class in the scikit-learn library. We used matplotlib to create the plot.
Split into training and testing sets
If we want to do supervised machine learning, it’s a good idea to split the data into training and testing sets so we don’t overfit.
R
trainRowCount <- floor(0.8 * nrow(nba))
set.seed(1)
trainIndex <- sample(1:nrow(nba), trainRowCount)
train <- nba[trainIndex,]
test <- nba[-trainIndex,]
Python
train = nba.sample(frac=0.8, random_state=1)
test = nba.loc[~nba.index.isin(train.index)]
You’ll notice that R has many more data-analysis focused builtins, like floor
, sample
, and set.seed
, whereas these are called via packages in Python (math.floor
, random.sample
, random.seed
). In Python, the recent version of pandas came with a sample
method that returns a certain proportion of rows randomly sampled from a source dataframe – this makes the code much more concise. In R, there are packages to make sampling simpler, but aren’t much more concise than using the built-in sample
function. In both cases, we set a random seed to make the results reproducible.
Univariate linear regression
Let’s say we want to predict number of assists per player from field goals made per player.
R
fit <- lm(ast ~ fg, data=train)
predictions <- predict(fit, test)
Python
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train[["fg"]], train["ast"])
predictions = lr.predict(test[["fg"]])
Scikit-learn has a linear regression model that we can fit and generate predictions from. R relies on the built-in lm
and predict
functions. predict
will behave differently depending on the kind of fitted model that is passed into it – it can be used with a variety of fitted models.
Calculate summary statistics for the model
R
summary(fit)
Call: lm(formula = ast ~ fg, data = train) Residuals: Min 1Q Median 3Q Max -228.26 -35.38 -11.45 11.99 559.61 [output truncated]
Python
import statsmodels.formula.api as sm
model = sm.ols(formula='ast ~ fga', data=train)
fitted = model.fit()
fitted.summary()
OLS Regression Results ============================ Dep. Variable: ast R-squared: 0.568 Model: OLS Adj. R-squared: 0.567 [output truncated]
If we want to get summary statistics about the fit, like r-squared value, we’ll need to do a bit more in Python than in R. With R, we can use the builtin summary
function to get information on the model. With Python, we need to use the statsmodels package, which enables many statistical methods to be used in Python. We get similar results, although generally it’s a bit harder to do statistical analysis in Python, and some statistical methods that exist in R don’t exist in Python.
Fit a random forest model
Our linear regression worked well in the single variable case, but we suspect there may be nonlinearities in the data. Thus, we want to fit a random forest model.
R
library(randomForest)
predictorColumns <- c("age", "mp", "fg", "trb", "stl", "blk")
rf <- randomForest(train[predictorColumns], train$ast, ntree=100)
predictions <- predict(rf, test[predictorColumns])
Python
from sklearn.ensemble import RandomForestRegressor
predictor_columns = ["age", "mp", "fg", "trb", "stl", "blk"]
rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=3)
rf.fit(train[predictor_columns], train["ast"])
predictions = rf.predict(test[predictor_columns])
The main difference here is that we needed to use the randomForest library in R to use the algorithm, whereas it was built in to scikit-learn in Python. scikit-learn has a unified interface for working with many different machine learning algorithms in Python, and there’s usually only one main implementation of each algorithm in Python. With R, there are many smaller packages containing individual algorithms, often with inconsistent ways to access them. This results in a greater diversity of algorithms (many have several implementations, and many are fresh out of research labs), but with a bit of a usability hit.