Would You Survive the Titanic? A Guide to Machine Learning in Python Part 2
This is part 2 of a 3 part introductory series on machine learning in Python, using the Titanic dataset.
Patrick Triest, SocialCops.
Editor's note: This is the second part of a 3 part introductory series on machine learning with Python. Catch up with yesterday's post first if you need.
Why Machine Learning?
We can draw some fairly straightforward conclusions from this data: Being a woman, being in 1st class, and being a child were all factors that could boost your chances of survival during this disaster.
Let’s say we wanted to write a program to predict whether a given passenger would survive the disaster. This could be done through an elaborate system of nested if-else statements with some sort of weighted scoring system, but such a program would be long, tedious to write, difficult to generalize, and would require extensive fine tuning.
This is where machine learning comes in: we will build a program that learns from the sample data in order to predict whether a given passenger would survive.
Preparing The Data
Before we can feed our dataset into a machine learning algorithm, we have to remove missing values and split it into training and test sets.
If we perform a count of each column, we will see that much of the data on certain fields is missing. Most machine learning algorithms will have a difficult time handling missing values, so we will need to make sure that each row has a value for each column.
titanic_df.count()
pclass 1309
survived 1309
name 1309
sex 1309
age 1046
sibsp 1309
parch 1309
ticket 1309
fare 1308
cabin 295
embarked 1307
boat 486
body 121
home.dest 745
dtype: int64
titanic_df = titanic_df.drop(['body','cabin','boat'], axis=1) titanic_df["home.dest"] = titanic_df["home.dest"].fillna("NA") titanic_df = titanic_df.dropna() titanic_df.count()
pclass 1043
survived 1043
name 1043
sex 1043
age 1043
sibsp 1043
parch 1043
ticket 1043
fare 1043
embarked 1043
home.dest 1043
dtype: int64
Most of the rows are missing values for “boat” and “cabin”, so we will remove these columns from the data frame. A large number of rows are also missing the “home.dest” field; here we fill the missing values with “NA”. A significant number rows are also missing an age value. We have seen above that age could have a significant effect on survival chances, so we will have to drop all of rows that are missing an age value. When we run the count command again we can see that all remaining columns now contain the same number of values.
Now we need to format the remaining data in a way that our machine learning algorithms will accept.
def preprocess_titanic_df(df): processed_df = df.copy() le = preprocessing.LabelEncoder() processed_df.sex = le.fit_transform(processed_df.sex) processed_df.embarked = le.fit_transform(processed_df.embarked) processed_df = processed_df.drop(['name','ticket','home.dest'],axis=1) return processed_df processed_df = preprocess_titanic_df(titanic_df)
The “sex” and “embarked” fields are both string values that correspond to categories(i.e “Male” and “Female”) so we will run each through a preprocessor. This preprocessor will convert these strings into integer keys, making it easier for the classification algorithms to find patterns. For instance, “Male” and “Female” will be converted to 0 and 1 respectively. The “name”, “ticket”, and “home.dest” columns consist of non-categorical string values, and as such are difficult to use in a classification algorithm, so we will drop them from the dataset.
X = processed_df.drop(['survived'], axis=1).values y = processed_df['survived'].values X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.2)
Next we separate the dataset into two arrays, “X” containing all of the values for each row besides “survived”, and “y” containing only the “survived” value for that row. The classification algorithms will compare the attribute values of “X” to the corresponding values of “y” in order to detect patterns in how different attributes values tend to affect the survival of a passenger.
Finally we break the “X” and “y” array into two parts each - a training set and a testing set. We will feed the training set into the classification algorithm in order to form a trained model. Once the model is formed, we will use it to classify the testing set, allowing us to determine the accuracy of the model. Here we have have made a 20/80 split, such that 80% of the dataset will be used for training and 20% will be used for testing.