R vs Python: head to head data analysis

The epic battle between R vs Python goes on. Here we are comparing both of them in terms of generic tasks of data scientist’s like reading CSV, finding data summary, PCA, model building, plotting, and many more.



Calculate error

Now that we’ve fit two models, let’s calculate error. We’ll use MSE.

R

mean((test["ast"] - predictions)^2)

4573.86778567462

Python

from sklearn.metrics import mean_squared_error
mean_squared_error(test["ast"], predictions)

4166.9202475632374

In Python, the scikit-learn library has a variety of error metrics that we can use. In R, there are likely some smaller libraries that calculate MSE, but doing it manually is pretty easy in either language. There’s a small difference in errors that almost certainly due to parameter tuning, and isn’t a big deal.

Download a webpage

Now that we have data on NBA players from 2013-2014, let’s scrape some additional data to supplement it. We’ll just look at one box score from the NBA Finals here to save time.

R

library(RCurl)
url <- "http://www.basketball-reference.com/boxscores/201506140GSW.html"
data <- readLines(url)

Python

import requests
url = "http://www.basketball-reference.com/boxscores/201506140GSW.html"
data = requests.get(url).content

In Python, the requests package makes downloading web pages easy, with a consistent API for all request types. In R, RCurl provides a similarly simple way to make requests. Both download the webpage to a character datatype. Note: this step is unnecessary for the next step in R, but is shown for comparisons’s sake.

Extract player box scores

Now that we have the web page, we’ll need to parse it to extract scores for players.

R

library(rvest)
page <- read_html(url)
table <- html_nodes(page, ".stats_table")[3]
rows <- html_nodes(table, "tr")
cells <- html_nodes(rows, "td a")
teams <- html_text(cells)

extractRow <-function(rows, i){
if(i == 1){
return
}
row <- rows[i]
tag <- “td”
if(i == 2){
tag <- “th”
}
items <- html_nodes(row, tag)
html_text(items)
}

scrapeData <-function(team){
teamData <- html_nodes(page, paste(“#”,team,”_basic”, sep=””))
rows <- html_nodes(teamData, “tr”)
lapply(seq_along(rows), extractRow, rows=rows)
}

data <- lapply(teams, scrapeData)

Python

from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(data, 'html.parser')
box_scores = []
for tag in soup.find_all(id=re.compile("[A-Z]{3,}_basic")):
rows = []
for i, row in enumerate(tag.find_all("tr")):
if i == 0:
continue
elif i == 1:
tag = "th"
else:
tag = "td"
row_data = [item.get_text() for item in row.find_all(tag)]
rows.append(row_data)
box_scores.append(rows)

This will create a list containing two lists, the first with the box score for CLE, and the second with the box score for GSW. Both contain the headers, along with each player and their in-game stats. We won’t turn this into more training data now, but it could easily be transformed into a format that could be added to our nba dataframe.

The R code is more complex than the Python code, because there isn’t a convenient way to use regular expressions to select items, so we have to do additional parsing to get the team names from the HTML. R also discourages using for loops in favor of applying functions along vectors. We use lapply to do this, but since we need to treat each row different depending on whether it’s a header or not, we pass the index of the item we want, and the entire rows list into the function.

We use rvest, a new and widely used R web scraping package to extract the data we need. Note that we can pass a url directly into rvest, so the last step wasn’t needed in R.

In Python, we use BeautifulSoup, the most commonly used web scraping package. It enables us to loop through the tags and construct a list of lists in a straightforward way.

Conclusion

We’ve taken a look at how to analyze a dataset with R and Python. There are many tasks we didn’t dive into, such as persisting the results of our analysis, sharing the results with others, testing and making things production-ready, and making more visualizations. We’ll dive into these at a later date, which will let us make some more definitive conclusions. For now, here’s what we can say:

R is more functional, Python is more object-oriented

As we saw from functions like lm, predict, and others, R lets functions do most of the work. Contrast this to the LinearRegression class in Python, and the sample method on dataframes.

R has more data analysis built-ins, Python relies on packages

When we looked at summary statistics, we could use the summary built-in function in R, but had to import the statsmodels package in Python. The dataframe is a built-in construct in R, but must be imported via the pandas package in Python.

Python has “main” packages for data analysis tasks, R has a larger ecosystem of small packages

With Python, we can do linear regression, random forests, and more with the scikit-learn package. It offers a consistent API, and is well-maintained. In R, we have a greater diversity of packages, but also greater fragmentation and less consistency (linear regression is a builtin, lm, randomForest is a separate package, etc).

R has more statistical support in general

R was built as a statistical language, and it shows. statsmodels in Python and other packages provide decent coverage for statistical methods, but the R ecosystem is far larger.

It’s usually more straightforward to do non-statistical tasks in Python

With well-maintained libraries like BeautifulSoup and requests, web scraping in Python is far easier than in R. This applies to other tasks that we didn’t look into closely, like saving to databases, deploying web servers, or running complex workflows.

There are many parallels between the data analysis workflow in both

There are clear points of inspiration between both R and Python (pandas Dataframes were inspired by R dataframes, the rvest package was inspired by BeautifulSoup), and both ecosystems continue to grow stronger. It’s remarkable how similar the syntax and approaches are for many common tasks in both languages.

Last word

At Dataquest, we primarily teach Python, but have recently been adding lessons on R. We see both languages as complementary, and although we think Python is stronger in more areas, R is an effective language. It can be used either as a complement for Python in areas like data exploration and statistics, or as your sole data analysis tool. As this walkthrough proves, both languages have a lot of similarities in syntax and approach, and you can’t go wrong with one, the other, or both.

Bio: Vik Paruchuri is a self-taught data scientist, and the founder of Dataquest.io, a platform for learning data science in your browser.

Original.

Related: