Getting Started with Data Science – Python
A great introductory post from DataRobot on getting started with data science in the Python ecosystem, including cleaning data and performing predictive modeling.
By Dallin Akagi and Mark Steadman, DataRobot.
This short tutorial will not only guide you through some basic data analysis methods but it will also show you how to implement some of the more sophisticated techniques available today. We will look into traffic accident data from the National Highway Traffic Safety Administration and try to predict fatal accidents using state-of-the-art statistical learning techniques. If you are interested, download the code at the bottom and follow along as we work through a real world data set. This post is in Python while a companion post covers the same techniques in R.
First things first
For those of you who are not familiar with Python and some of its most popular libraries for data science, please follow along with this blogpost, which will get you set up with an environment similar to the one we will be using. There are instructions for Mac, Linux, and Windows environments, so hopefully we have all the bases covered.
IPython is awesome, as you will come to find out.
Get some data
Being able to play with data requires having the data itself, so let’s take care of that right now. The National Highway Traffic Safety Administration (NHTSA) has some really cool data that they make public. The following code snibackground:#000;ppet will take care of downloading the data to a new directory, and extracting the files from that zipfile. The zip is 14.9 MB so it might take some time to run – it is worth the wait! This is really cool data.
import zipfile import urllib2 import os source_url = 'ftp://ftp.nhtsa.dot.gov/GES/GES12/GES12_Flatfile.zip' zip_name = 'GES12_Flatfile.zip' cwd = os.getcwd() dir_path = os.path.join(cwd, 'GES2012') zip_path = os.path.join(dir_path, zip_name) # We'll make a directory for you to play around with, # then when you're done playing you can just delete the directory if not os.path.exists(dir_path): os.makedirs(dir_path) # Download the file from GES website if you haven't already if not os.path.exists(zip_path): response = urllib2.urlopen(source_url) with open(zip_path, 'wb') as fh: x = response.read() fh.write(x) # Extract all the files from that zipfile with zipfile.ZipFile(os.path.join(dir_path, zip_name), 'r') as z: z.extractall(dir_path) #See what we just unzipped os.listdir(dir_path)
['VIOLATN.TXT', 'DRIMPAIR.TXT', 'VEHICLE.TXT', 'PERSON.TXT', 'VSOE.TXT', 'PARKWORK.TXT', '2012GESFlatFileTXT.sas', 'GES12_Flatfile.zip', 'NMPRIOR.TXT', 'VEVENT.TXT', 'DISTRACT.TXT', 'CEVENT.TXT', 'DAMAGE.TXT', 'ACCIDENT.TXT', 'SAFETYEQ.TXT', 'VISION.TXT', 'NMIMPAIR.TXT', 'FACTOR.TXT', 'MANEUVER.TXT', 'NMCRASH.TXT']
Load the data into Python
With our data downloaded and readily accessible, we can start to play around and see what we can learn from the data. Many of the columns have an encoding that you will need to readthe manual in order to understand, so it might be useful to download that PDF so you can easily refer to it. We will be looking at PERSON.TXT
, which contains information at the level of the individuals involved in the accidents.
import pandas as pd import numpy as np import sklearn cwd = os.getcwd() dir_path = os.path.join(cwd, 'GES2012') input_file_path = os.path.join(dir_path, 'PERSON.TXT') input_data = pd.read_csv(input_file_path, delimiter='\t') sorted(input_data.columns)
['AGE', 'AGE_IM', 'AIR_BAG', 'ALC_RES', 'ALC_STATUS', 'ATST_TYP', 'BODY_TYP', 'CASENUM', 'DRINKING', 'DRUGRES1', 'DRUGRES2', 'DRUGRES3', 'DRUGS', 'DRUGTST1', 'DRUGTST2', 'DRUGTST3', 'DSTATUS', 'EJECTION', 'EJECT_IM', 'EMER_USE', 'FIRE_EXP', 'HARM_EV', 'HOSPITAL', 'HOUR',background:#000; 'IMPACT1', 'INJSEV_IM', 'INJ_SEV', 'LOCATION', 'MAKE', 'MAN_COLL', 'MINUTE', 'MOD_YEAR', 'MONTH', 'PERALCH_IM', 'PER_NO', 'PER_TYP', 'PJ', 'PSU', 'PSUSTRAT', 'P_SF1', 'P_SF2', 'P_SF3', 'REGION', 'REST_MIS', 'REST_USE', 'ROLLOVER', 'SCH_BUS', 'SEAT_IM', 'SEAT_POS', 'SEX', 'SEX_IM', 'SPEC_USE', 'STRATUM', 'STR_VEH', 'TOW_VEH', 'VEH_NO', 'VE_FORMS', 'WEIGHT']
Clean up the data
One prediction task you might find interesting is predicting whether or not a crash was fatal. The column INJSEV_IM
contains imputed values for the severity of the injury, but there is still one value that might complicate analysis – level 6 indicates that the person died prior to the crash.
input_data.INJSEV_IM.value_counts()
0 100840 2 20758 1 19380 3 9738 5 1179 4 1178 6 4 dtype: int64
Fortunately, there are only four of those cases within the dataset, so it is not unreasonable to ignore them during our analysis. However, we will find that a few of the columns in the data have missing values:
# Drop those odd cases input_data = input_data[input_data.INJSEV_IM != 6] for column_name in input_data.columns: n_nans = input_data[column_name].isnull().sum() if n_nans > 0: print column_name, n_nans
MAKE 5162 BODY_TYP 5162 MOD_YEAR 5162 TOW_VEH 5162 SPEC_USE 5162 EMER_USE 5162 ROLLOVER 5162 IMPACT1 5162 FIRE_EXP 5162
For this analysis, we will just drop these rows (they are all the same rows) – but you certainly don’t have to do that. In fact, maybe there is a systematic data entry error that is causing them to be interpreted incorrectly. Regardless of the way you cleanup this data, we will most assuredly want to drop the column INJ_SEV
, as it is the non-imputed version of INJSEV_IM
and is a pretty severe data leak – there are others as well.
print input_data.shape data = input_data[~input_data.MAKE.isnull()] discarded = data.pop('INJ_SEV') target = data.pop('INJSEV_IM') print data.shape
(153073, 58) (147911, 56)
One more preprocessing step we’ll do is to transform the response. If you flip to the manual it shows that category 4
is a fatal injury – so we will encode our target as such.
target = (target == 4).astype('float')