Getting Started with Data Science – R
A great introductory post from DataRobot on getting started with data science in R, including cleaning data and performing predictive modeling.
Now we model!
We have predictors, we have a target, now it is time to build a model. We will be using ordinary least squares, Ridge Regression and Lasso Regression, both forms of regularized Linear Regression, a Gradient Boosting Machine (GBM), and a CART decision tree, to have some variety in modeling methods. These are just some representatives from the many packages available in R, which gives you access to quite a few machine learning techniques.
Don’t be alarmed if this cell block takes quite a bit of time to run – the data is of non-negligible size. Additionally the ridge classifier is running several times to compute an optimal penalty parameter, and the gradient boosting classifier is building many trees in order to produce its ensembled decisions. There is a lot of computation going on under the hood, so get up and take a break if you need.
# Split the data into test and train sets
train_rows <- sample(nrow(data), round(nrow(data) * 0.5))
traindf <- data[train_rows, ]
testdf <- data[-train_rows, ]
First the linear model:
OLS_model <- lm(INJSEV_IM ~ ., data = traindf)
Then the GBM:
print("Started Training GBM")
## [1] "Started Training GBM"
# GBM is easier to process as a data matrix
response_column <- which(colnames(traindf) == "INJSEV_IM")
trainy <- traindf$INJSEV_IM
gbm_formula <- as.formula(paste0("INJSEV_IM ~ ", paste(colnames(traindf[, -response_column]),
collapse = " + ")))
gbm_model <- gbm(gbm_formula, traindf, distribution = "bernoulli", n.trees = 500,
bag.fraction = 0.75, cv.folds = 5, interaction.depth = 3)
## Warning: variable 16: STR_VEH has no variation.
## Warning: variable 47: DRUGTST2 has no variation.
## Warning: variable 48: DRUGTST3 has no variation.
## Warning: variable 50: DRUGRES2 has no variation.
## Warning: variable 51: DRUGRES3 has no variation.
## Warning: variable 56: LOCATION has no variation.
print("Finished Training GBM")
## [1] "Finished Training GBM"
# For glmnet we make a copy of our dataframe into a matrix
trainx_dm <- data.matrix(traindf[, -response_column])
print("Started fitting LASSO")
## [1] "Started fitting LASSO"
lasso_model <- cv.glmnet(x = trainx_dm, y = traindf$INJSEV_IM, alpha = 1)
print("Finished fitting LASSO")
## [1] "Finished fitting LASSO"
print("Started fitting RIDGE")
## [1] "Started fitting RIDGE"
ridge_model <- cv.glmnet(x = trainx_dm, y = traindf$INJSEV_IM, alpha = 0)
print("Finished fitting RIDGE")
## [1] "Finished fitting RIDGE"
And finally, we make a decison tree:
dtree_model <- rpart(INJSEV_IM ~ ., traindf)
Now we can make predictions. For the GBM, we need to decide on how many trees to predict with. The following will plot how well the GBM performs at each of the 500 iterations, one for each additional tree. We want to minimize the green line, which represents the model performance on test data.
gbm_perf <- gbm.perf(gbm_model, method = "cv")
Notice that since the curve is still decreasing, we could try training more trees at the expense of more computation.
Now we can make predictions using our trained models:
predictions_ols <- predict(OLS_model, testdf[, -response_column])
## Warning: prediction from a rank-deficient fit may be misleading
predictions_gbm <- predict(gbm_model, newdata = testdf[, -response_column],
n.trees = gbm_perf, type = "response")
testx_dm <- data.matrix(testdf[, -response_column])
predictions_lasso <- predict(lasso_model, newx = testx_dm, type = "response",
s = "lambda.min")[, 1]
predictions_ridge <- predict(ridge_model, newx = testx_dm, type = "response",
s = "lambda.min")[, 1]
predictions_dtree <- predict(dtree_model, testdf[, -response_column])
We can now assess model performance on the test set. We will be using the metric of area under the ROC curve. A perfect classifier would score 1.0 while purely random guessing would score 0.5.
print("OLS: Area under the ROC curve:")
## [1] "OLS: Area under the ROC curve:"
auc(testdf$INJSEV_IM, predictions_ols)
## [1] 0.9338
print("Rdige: Area under the ROC curve:")
## [1] "Rdige: Area under the ROC curve:"
auc(testdf$INJSEV_IM, predictions_ridge)
## [1] 0.9348
print("LASSO: Area under the ROC curve:")
## [1] "LASSO: Area under the ROC curve:"
auc(testdf$INJSEV_IM, predictions_lasso)
## [1] 0.9339
print("GBM: Area under the ROC curve:")
## [1] "GBM: Area under the ROC curve:"
auc(testdf$INJSEV_IM, predictions_gbm)
## [1] 0.921
print("Decision Tree: Area under the ROC curve:")
## [1] "Decision Tree: Area under the ROC curve:"
auc(testdf$INJSEV_IM, predictions_dtree)
## [1] 0.8473
What else can I do?
We have a blogpost that goes into more detail about regularized linear regression, if that is what you are interested in. It would also be good to look at the various other packages in R, listed on CRAN under the task “MachineLearning” available here. Beyond that, here are a few challenges that you can undertake to help you hone your data science skills.
Data Prep
If it wasn’t obvious in the blog post, the column STRATUM
is a data leak (it encodes the severity of the crash). Which other columns contain data leaks? Can you come up with a rigorous method to generate candidates for deletion without having to read the entire GES manual?
And while we are considering data preparation, consider the column REGION
. Any regression model will consider the West region to be 4 times more REGION
-y than the Northeast – that just doesn’t make sense. Which columns could benefit from being encoded as factor levels, rather than as numeric? To change a column into a factor, use the as.factor
command.
Which is the best model?
How good of a model can you build for predicting fatalities from car crashes? First you will need to settle on a metric of “good” – and be prepared to reason why it is a good metric. How bad is it to be wrong? How good is it to be right?
In order to avoid overfitting you will want to separate some of the data and hold it in reserve for when you evaluate your models – some of these models are expressive enough to memorize all the data!
Which is the best story?
Of course, data science is more than just gathering data and building models – it’s about telling a story backed up by the data. Do crashes with alcohol involved tend to lead to more serious injuries? When it is late at night, are there more convertibles involved in crashes than other types of vehicles (this one involves looking at a different dataset with the GES data)? Which is the safest seat in a car? And how sure can you be that your findings are statistically relevant?
Good luck coming up with a great story!
Download Notebook Download R File
This post was written by Mark Steadman and Dallin Akagi. Please post any feedback, comments, or questions below or send us an email at
Original. Reposted with permission.
Related: