KDnuggets Home » News » 2016 » Aug » Tutorials, Overviews » Getting Started with Data Science – R ( 16:n29 )

Getting Started with Data Science – R

A great introductory post from DataRobot on getting started with data science in R, including cleaning data and performing predictive modeling.

By Dallin Akagi and Mark Steadman, DataRobot.

This short tutorial will not only guide you through some basic data analysis methods but it will also show you how to implement some of the more sophisticated techniques available today. We will look into traffic accident data from the National Highway Traffic Safety Administration and try to predict fatal accidents using state-of-the-art statistical learning techniques. If you are interested, download the code at the bottom and follow along as we work through a real world data set. This post is in R while a companion post covers the same techniques in Python.


Getting started in R

The swirl package is designed to teach people R. You can visit their website to see links to YouTube videos on installing R on both Mac and Windows.

You can download R directly using these links for the Windows installer or the Mac package. The R install packages are hosted by the CRAN project. Just bear in mind that these are large files (60-70MB) and will take some time to download and install.

We suggest you install and view this file in RStudio. After you have installed R, download and run the installer for Rstudio. We recommend using RStudio as the IDE, but you can also use the R console directly if you choose.

If you open this file in RStudio, you can see the code is stored in “cells”“ or “chunks” like this:

# This is a code cell

You can also enter the code in the cells directly at the R command prompt.

If you click in a cell, you can run the code in that cell by selecting “Run Current Chunk” under the Chunks menu.

Try running the code in the following cell, which will load the libraries needed for the rest of this tutorial.

library("glmnet")  # For ridge regression fitting. It also supports elastic-net and LASSO models 

## Loading required package: Matrix
## Loaded glmnet 1.9-5

library("gbm")  # For Gradient-Boosting

## Loading required package: survival
## Loading required package: splines
## Loading required package: lattice
## Loading required package: parallel
## Loaded gbm 2.1

library("rpart")  # For building decision trees

If you get an error, delete the # and run the following code chunk to install the needed packages.

# install.packages(c('glmnet', 'gbm', 'rpart'))

If you prefer, you can download a version with just the R code here, which you can load into R via the source command.

Now that you have R installed, we can start with our analysis. Inside RStudio, select “Run Next Chunk” under the “Chunks” menu to run the examples one at a time.

Get some data

Being able to play with data requires having the data available, so let’s take care of that right now. The National Highway Traffic Safety Administration (NHTSA) has some really cool data that they make public. The following code snippet will take care of downloading the data to a temporary file, and extract the file we are interested in, “PERSON.TXT”, from the zipfile. Finally, it loads the data into R. The zip is 14.9 MB so it might take some time to run – it is worth the wait! This is really cool data.

temp <- tempfile()
download.file("ftp://ftp.nhtsa.dot.gov/GES/GES12/GES12_Flatfile.zip", temp, 
    quiet = TRUE)
accident_data_set <- read.delim(unz(temp, "PERSON.TXT"))

With our data downloaded and readily accessible, we can start to play around and see what we can learn from the data. Many of the columns have an encoding that you will need to read the manual in order to understand; it might be useful to download the PDF so you can easily refer to it. Again, we will be looking at PERSON.TXT, which contains information about individuals involved in road accidents.

##  [1] "AGE"        "AGE_IM"     "AIR_BAG"    "ALC_RES"    "ALC_STATUS"
##  [6] "ATST_TYP"   "BODY_TYP"   "CASENUM"    "DRINKING"   "DRUGRES1"  
## [11] "DRUGRES2"   "DRUGRES3"   "DRUGS"      "DRUGTST1"   "DRUGTST2"  
## [16] "DRUGTST3"   "DSTATUS"    "EJECT_IM"   "EJECTION"   "EMER_USE"  
## [21] "FIRE_EXP"   "HARM_EV"    "HOSPITAL"   "HOUR"       "IMPACT1"   
## [26] "INJ_SEV"    "INJSEV_IM"  "LOCATION"   "MAKE"       "MAN_COLL"  
## [31] "MINUTE"     "MOD_YEAR"   "MONTH"      "PERALCH_IM" "PER_NO"    
## [36] "PER_TYP"    "PJ"         "P_SF1"      "P_SF2"      "P_SF3"     
## [41] "PSU"        "PSUSTRAT"   "REGION"     "REST_MIS"   "REST_USE"  
## [46] "ROLLOVER"   "SCH_BUS"    "SEAT_IM"    "SEAT_POS"   "SEX"       
## [51] "SEX_IM"     "SPEC_USE"   "STRATUM"    "STR_VEH"    "TOW_VEH"   
## [56] "VE_FORMS"   "VEH_NO"     "WEIGHT"

Clean up the data

One prediction task you might find interesting is predicting whether or not a crash was fatal. The column INJSEV_IM contains imputed values for the severity of the injury, but there is one value that might complicate analysis – level 6 indicates that the person died prior to the crash.

##      0      1      2      3      4      5      6 
## 100840  19380  20758   9738   1178   1179      4

Fortunately, there are only four of those cases within the dataset, so it is not unreasonable to ignore them during our analysis. However, we will find that a few of these columns have missing values:

accident_data_set <- accident_data_set[accident_data_set$INJSEV_IM != 6, ]

for (i in 1:ncol(accident_data_set)) {
    if (sum(as.numeric(is.na(accident_data_set[, i]))) > 0) {
        num_missing <- sum(as.numeric(is.na(accident_data_set[, i])))
        print(paste0(colnames(accident_data_set)[i], ":  ", num_missing))


## [1] "MAKE:  5162"
## [1] "BODY_TYP:  5162"
## [1] "MOD_YEAR:  5162"
## [1] "TOW_VEH:  5162"
## [1] "SPEC_USE:  5162"
## [1] "EMER_USE:  5162"
## [1] "ROLLOVER:  5162"
## [1] "IMPACT1:  5162"
## [1] "FIRE_EXP:  5162"

For this analysis, we will just drop these rows (they are all the same rows) – but you certainly don’t have to do that. In fact, maybe there is a systematic data entry error that is causing them to be interpreted incorrectly. Regardless of the way you cleanup this data, we will most assuredly want to drop the column INJ_SEV, as it is the non-imputed version of INJSEV_IM and is a pretty severe data leak – there are others as well.

rows_to_drop <- which(apply(accident_data_set, 1, FUN = function(X) {
    return(sum(is.na(X)) > 0)
data <- accident_data_set[-rows_to_drop, ]
data$INJ_SEV <- NULL

One more preprocessing step we’ll do is to transform the response. If you flip to the manual it shows that category 4 is a fatal injury – so we will encode our target as such.

data$INJSEV_IM <- as.numeric(data$INJSEV_IM == 4)
target <- data$INJSEV_IM