Topics: Coronavirus | AI | Data Science | Deep Learning | Machine Learning | Python | R | Statistics

KDnuggets Home » News » 2015 » Oct » Software » R vs Python: head to head data analysis ( 15:n33 )

The epic battle between R vs Python goes on. Here we are comparing both of them in terms of generic tasks of data scientist’s like reading CSV, finding data summary, PCA, model building, plotting, and many more.

### Make clusters of the players

One good way to explore this kind of data is to generate cluster plots. These will show which players are most similar.

R

```library(cluster) set.seed(1) isGoodCol <- function(col){ sum(is.na(col)) == 0 && is.numeric(col) } goodCols <- sapply(nba, isGoodCol) clusters <- kmeans(nba[,goodCols], centers=5) labels <- clusters\$cluster```

Python

```from sklearn.cluster import KMeans kmeans_model = KMeans(n_clusters=5, random_state=1) good_columns = nba._get_numeric_data().dropna(axis=1) kmeans_model.fit(good_columns) labels = kmeans_model.labels_```

In order to cluster properly, we remove any non-numeric columns, or columns with missing values (`NA`, `Nan`, etc). In R, we do this by applying a function across each column, and removing it if it has any missing values or isn’t numeric. We then use the cluster package to perform k-means and find `5` clusters in our data. We set a random seed using `set.seed` to be able to reproduce our results.

In Python, we use the main Python machine learning package, scikit-learn, to fit a k-means clustering model and get our cluster labels. We perform very similar methods to prepare the data that we used in R, except we use the `get_numeric_data` and `dropna` methods to remove non-numeric columns and columns with missing values.

### Plot players by cluster

We can now plot out the players by cluster to discover patterns. One way to do this is to first use PCA to make our data 2-dimensional, then plot it, and shade each point according to cluster association.

R

```nba2d <- prcomp(nba[,goodCols], center=TRUE) twoColumns <- nba2d\$x[,1:2] clusplot(twoColumns, labels)``` Python

```from sklearn.decomposition import PCA pca_2 = PCA(2) plot_columns = pca_2.fit_transform(good_columns) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=labels) plt.show()``` Made a scatter plot of our data, and shaded or changed the icon of the data according to cluster. In R, the `clusplot` function was used, which is part of the cluster library. We performed PCA via the `pccomp` function that is builtin to R.

With Python, we used the PCA class in the scikit-learn library. We used matplotlib to create the plot.

### Split into training and testing sets

If we want to do supervised machine learning, it’s a good idea to split the data into training and testing sets so we don’t overfit.

R

```trainRowCount <- floor(0.8 * nrow(nba)) set.seed(1) trainIndex <- sample(1:nrow(nba), trainRowCount) train <- nba[trainIndex,] test <- nba[-trainIndex,]```

Python

```train = nba.sample(frac=0.8, random_state=1) test = nba.loc[~nba.index.isin(train.index)]```

You’ll notice that R has many more data-analysis focused builtins, like `floor`, `sample`, and `set.seed`, whereas these are called via packages in Python (`math.floor`, `random.sample`, `random.seed`). In Python, the recent version of pandas came with a `sample` method that returns a certain proportion of rows randomly sampled from a source dataframe – this makes the code much more concise. In R, there are packages to make sampling simpler, but aren’t much more concise than using the built-in `sample` function. In both cases, we set a random seed to make the results reproducible.

### Univariate linear regression

Let’s say we want to predict number of assists per player from field goals made per player.

R

```fit <- lm(ast ~ fg, data=train) predictions <- predict(fit, test)```

Python

```from sklearn.linear_model import LinearRegression lr = LinearRegression() lr.fit(train[["fg"]], train["ast"]) predictions = lr.predict(test[["fg"]])```

Scikit-learn has a linear regression model that we can fit and generate predictions from. R relies on the built-in `lm` and `predict` functions. `predict` will behave differently depending on the kind of fitted model that is passed into it – it can be used with a variety of fitted models.

### Calculate summary statistics for the model

R

`summary(fit)`

```Call:
lm(formula = ast ~ fg, data = train)

Residuals:
Min      1Q  Median      3Q     Max
-228.26  -35.38  -11.45   11.99  559.61
[output truncated]```

Python

```import statsmodels.formula.api as sm model = sm.ols(formula='ast ~ fga', data=train) fitted = model.fit() fitted.summary()```

```OLS Regression Results
============================
Dep. Variable:                    ast
R-squared:                       0.568
Model:                            OLS
[output truncated]```

If we want to get summary statistics about the fit, like r-squared value, we’ll need to do a bit more in Python than in R. With R, we can use the builtin `summary` function to get information on the model. With Python, we need to use the statsmodels package, which enables many statistical methods to be used in Python. We get similar results, although generally it’s a bit harder to do statistical analysis in Python, and some statistical methods that exist in R don’t exist in Python.

### Fit a random forest model

Our linear regression worked well in the single variable case, but we suspect there may be nonlinearities in the data. Thus, we want to fit a random forest model.

R

```library(randomForest) predictorColumns <- c("age", "mp", "fg", "trb", "stl", "blk") rf <- randomForest(train[predictorColumns], train\$ast, ntree=100) predictions <- predict(rf, test[predictorColumns])```

Python

```from sklearn.ensemble import RandomForestRegressor predictor_columns = ["age", "mp", "fg", "trb", "stl", "blk"] rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=3) rf.fit(train[predictor_columns], train["ast"]) predictions = rf.predict(test[predictor_columns])```

The main difference here is that we needed to use the randomForest library in R to use the algorithm, whereas it was built in to scikit-learn in Python. scikit-learn has a unified interface for working with many different machine learning algorithms in Python, and there’s usually only one main implementation of each algorithm in Python. With R, there are many smaller packages containing individual algorithms, often with inconsistent ways to access them. This results in a greater diversity of algorithms (many have several implementations, and many are fresh out of research labs), but with a bit of a usability hit. Get KDnuggets, a leading newsletter on AI, Data Science, and Machine Learning