# Getting Started with Data Science – Python

A great introductory post from DataRobot on getting started with data science in the Python ecosystem, including cleaning data and performing predictive modeling.

### Now we model!

We have predictors, we have a target, now it is time to build a model. We will be using ordinary least squares, a Ridge Regression and Lasso Regression, both being forms of regularized Linear Regression, Gradient Boosting Machine (GBM) and a CART to have some variety in modeling methods. These are just some representatives from the scikit-learn library, which gives access to quite a few machine learning techniques.

Don’t be alarmed if these cell blocks take quite a bit of time to run – the data is of non-negligible size. Additionally, some of the models perform a search over several parameters to find a best fit, and the gradient boosting classifier is building many trees in order to produce its ensembled decisions. There is a lot of computation going on under the hood, so get up and take a break if you need.

from sklearn.linear_model import LinearRegression from sklearn.metrics import roc_auc_score from sklearn.grid_search import GridSearchCV # Train on half of the data while reserving the other half for # model comparisons xtrain, xtest, ytrain, ytest = sklearn.cross_validation.train_test_split( data.values, target.values, train_size=0.5) linreg = LinearRegression() linreg.fit(xtrain, ytrain) lr_preds = linreg.predict(xtest) lr_perf = roc_auc_score(ytest, lr_preds) print 'OLS: Area under the ROC curve = {}'.format(lr_perf)

OLS: Area under the ROC curve = 0.935139729907

from sklearn.linear_model import Ridge ridge = GridSearchCV(Ridge(), {'alpha': np.logspace(-10, 10, 10)}) ridge.fit(xtrain, ytrain) ridge_preds = ridge.predict(xtest) ridge_performance = roc_auc_score(ytest, ridge_preds) print 'Ridge: Area under the ROC curve = {}'.format(ridge_performance)

Ridge: Area under the ROC curve = 0.935221465912

from sklearn.linear_model import Lasso from sklearn.metrics import roc_auc_score from sklearn.grid_search import GridSearchCV lasso = GridSearchCV(Lasso(), {'alpha': np.logspace(-10, -8, 5)}) lasso.fit(xtrain, ytrain) lasso_preds = lasso.predict(xtest) lasso_performance = roc_auc_score(ytest, lasso_preds) print 'Lasso: Area under the ROC curve = {}'.format(lasso_performance)

Lasso: Area under the ROC curve = 0.939851198289

from sklearn.ensemble import GradientBoostingClassifier from sklearn.metrics import roc_auc_score from sklearn.grid_search import GridSearchCV gbm = GradientBoostingClassifier(n_estimators=500) gbm.fit(xtrain, ytrain) gbm_preds = gbm.predict_proba(xtest)[:, 1] gbm_performance = roc_auc_score(ytest, gbm_preds) print 'GBM: Area under the ROC curve = {}'.format(gbm_performance)

GBM: Area under the ROC curve = 0.970431372668

from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import roc_auc_score tree = GridSearchCV(DecisionTreeClassifier(), {'max_depth': np.arange(3, 10)}) tree.fit(xtrain, ytrain) tree_preds = tree.predict_proba(xtest)[:, 1] tree_performance = roc_auc_score(ytest, tree_preds) print 'DecisionTree: Area under the ROC curve = {}'.format(tree_performance)

DecisionTree: Area under the ROC curve = 0.913468225877

As one final morsel for you to chew on, it would be good to understand which variables the GBM model thinks are most useful for classification. Spoiler alert: data leaks ahead.

importances = pd.Series(gbm.feature_importances_, index=data.columns) print importances.order(ascending=False)[:10]

STRATUM 0.261623 REST_USE 0.123142 EJECT_IM 0.071433 AGE_IM 0.056080 ALC_RES 0.050908 AIR_BAG 0.048473 HARM_EV 0.044688 HOUR 0.043379 DRUGS 0.042505 PERALCH_IM 0.035529 dtype: float64

### Now what should I do?

We have a great blogpost that goes into more detail about regularized linear regression, if that is what you are interested in. It would also be good to look into all the models that are offered by scikit-learn – you might find some you have never heard of! Beyond that, here are a few challenges that you can undertake to help you hone your data science skills.

**Data Prep**

If it wasn’t obvious in the blog post, the column `STRATUM`

is a data leak (it encodes the severity of the crash). Which other columns contain data leaks? Can you come up with a rigorous method to generate candidates for deletion without having to read the entire GES manual?

And while we are considering data preparation, consider the column `REGION`

. Any regression model will consider the West region to be 4 times more `REGION`

-y than the Northeast – that just doesn’t make sense. Which columns could benefit from a one-hot encoding?

**Which is the best model?**

How good of a model can you build for predicting fatalities from car crashes? First you will need to settle on a metric of “good” – and be prepared to reason why it is a good metric. How bad is it to be wrong? How good is it to be right?

In order to avoid overfitting you will want to separate some of the data and hold it in reserve for when you evaluate your models – some of these models are expressive enough to memorize all the data!

**Which is the best story?**

Of course, data science is more than just gathering data and building models – it’s about telling a story backed-up by the data. Do crashes with alcohol involved tend to lead to more serious injuries? When it is late at night, are there more convertibles involved in crashes than other types of vehicles (this one involves looking at a different dataset within the GES data)? Which is the safest seat in a car? And how sure can you be that your findings are statistically relevant?

Good luck coming up with a great story!

Download Notebook View on NBViewer

This post was written by Dallin Akagi and Mark Steadman. Please post any feedback, comments, or questions below or send us an email at <firstname>@datarobot.com.

This post was inspired from the StatLearning MOOC by Stanford.

**About: DataRobot** offers a machine learning platform for data scientists of all skill levels to build and deploy accurate predictive models in a fraction of the time it used to take. The technology addresses the critical shortage of data scientists by changing the speed and economics of predictive analytics.

Original. Reposted with permission.

**Related:**