Getting Started with Data Science – Python

A great introductory post from DataRobot on getting started with data science in the Python ecosystem, including cleaning data and performing predictive modeling.

By Dallin Akagi and Mark Steadman, DataRobot.

This short tutorial will not only guide you through some basic data analysis methods but it will also show you how to implement some of the more sophisticated techniques available today. We will look into traffic accident data from the National Highway Traffic Safety Administration and try to predict fatal accidents using state-of-the-art statistical learning techniques.  If you are interested, download the code at the bottom and follow along as we work through a real world data set. This post is in Python while a companion post covers the same techniques in R.


First things first

For those of you who are not familiar with Python and some of its most popular libraries for data science, please follow along with this blogpost, which will get you set up with an environment similar to the one we will be using. There are instructions for Mac, Linux, and Windows environments, so hopefully we have all the bases covered.

IPython is awesome, as you will come to find out.

Get some data

Being able to play with data requires having the data itself, so let’s take care of that right now. The National Highway Traffic Safety Administration (NHTSA) has some really cool data that they make public. The following code snibackground:#000;ppet will take care of downloading the data to a new directory, and extracting the files from that zipfile. The zip is 14.9 MB so it might take some time to run – it is worth the wait! This is really cool data.

import zipfile
import urllib2
import os

source_url = ''
zip_name = ''
cwd = os.getcwd()
dir_path  = os.path.join(cwd, 'GES2012')
zip_path = os.path.join(dir_path, zip_name)

# We'll make a directory for you to play around with,
# then when you're done playing you can just delete the directory
if not os.path.exists(dir_path):

# Download the file from GES website if you haven't already
if not os.path.exists(zip_path):
    response = urllib2.urlopen(source_url)
    with open(zip_path, 'wb') as fh:
        x =

# Extract all the files from that zipfile
with zipfile.ZipFile(os.path.join(dir_path, zip_name), 'r') as z:

#See what we just unzipped


Load the data into Python

With our data downloaded and readily accessible, we can start to play around and see what we can learn from the data. Many of the columns have an encoding that you will need to readthe manual in order to understand, so it might be useful to download that PDF so you can easily refer to it. We will be looking at PERSON.TXT, which contains information at the level of the individuals involved in the accidents.

import pandas as pd
import numpy as np
import sklearn

cwd = os.getcwd()
dir_path  = os.path.join(cwd, 'GES2012')
input_file_path = os.path.join(dir_path, 'PERSON.TXT')

input_data = pd.read_csv(input_file_path, delimiter='\t')



Clean up the data

One prediction task you might find interesting is predicting whether or not a crash was fatal. The column INJSEV_IM contains imputed values for the severity of the injury, but there is still one value that might complicate analysis – level 6 indicates that the person died prior to the crash.


0    100840
2     20758
1     19380
3      9738
5      1179
4      1178
6         4
dtype: int64

Fortunately, there are only four of those cases within the dataset, so it is not unreasonable to ignore them during our analysis. However, we will find that a few of the columns in the data have missing values:

# Drop those odd cases
input_data = input_data[input_data.INJSEV_IM != 6]

for column_name in input_data.columns:
    n_nans = input_data[column_name].isnull().sum()
    if n_nans > 0:
        print column_name, n_nans

MAKE 5162
TOW_VEH 5162
IMPACT1 5162

For this analysis, we will just drop these rows (they are all the same rows) – but you certainly don’t have to do that. In fact, maybe there is a systematic data entry error that is causing them to be interpreted incorrectly. Regardless of the way you cleanup this data, we will most assuredly want to drop the column INJ_SEV, as it is the non-imputed version of INJSEV_IM and is a pretty severe data leak – there are others as well.

print input_data.shape
data = input_data[~input_data.MAKE.isnull()]
discarded = data.pop('INJ_SEV')
target = data.pop('INJSEV_IM')
print data.shape

(153073, 58)
(147911, 56)

One more preprocessing step we’ll do is to transform the response. If you flip to the manual it shows that category 4 is a fatal injury – so we will encode our target as such.

target = (target == 4).astype('float')