Getting Started with Data Science – Python
A great introductory post from DataRobot on getting started with data science in the Python ecosystem, including cleaning data and performing predictive modeling.
By Dallin Akagi and Mark Steadman, DataRobot.
This short tutorial will not only guide you through some basic data analysis methods but it will also show you how to implement some of the more sophisticated techniques available today. We will look into traffic accident data from the National Highway Traffic Safety Administration and try to predict fatal accidents using state-of-the-art statistical learning techniques. If you are interested, download the code at the bottom and follow along as we work through a real world data set. This post is in Python while a companion post covers the same techniques in R.
First things first
For those of you who are not familiar with Python and some of its most popular libraries for data science, please follow along with this blogpost, which will get you set up with an environment similar to the one we will be using. There are instructions for Mac, Linux, and Windows environments, so hopefully we have all the bases covered.
IPython is awesome, as you will come to find out.
Get some data
Being able to play with data requires having the data itself, so let’s take care of that right now. The National Highway Traffic Safety Administration (NHTSA) has some really cool data that they make public. The following code snibackground:#000;ppet will take care of downloading the data to a new directory, and extracting the files from that zipfile. The zip is 14.9 MB so it might take some time to run – it is worth the wait! This is really cool data.
Load the data into Python
With our data downloaded and readily accessible, we can start to play around and see what we can learn from the data. Many of the columns have an encoding that you will need to readthe manual in order to understand, so it might be useful to download that PDF so you can easily refer to it. We will be looking at
PERSON.TXT, which contains information at the level of the individuals involved in the accidents.
Clean up the data
One prediction task you might find interesting is predicting whether or not a crash was fatal. The column
INJSEV_IM contains imputed values for the severity of the injury, but there is still one value that might complicate analysis – level 6 indicates that the person died prior to the crash.
Fortunately, there are only four of those cases within the dataset, so it is not unreasonable to ignore them during our analysis. However, we will find that a few of the columns in the data have missing values:
For this analysis, we will just drop these rows (they are all the same rows) – but you certainly don’t have to do that. In fact, maybe there is a systematic data entry error that is causing them to be interpreted incorrectly. Regardless of the way you cleanup this data, we will most assuredly want to drop the column
INJ_SEV, as it is the non-imputed version of
INJSEV_IM and is a pretty severe data leak – there are others as well.
One more preprocessing step we’ll do is to transform the response. If you flip to the manual it shows that category
4 is a fatal injury – so we will encode our target as such.