Would You Survive the Titanic? A Guide to Machine Learning in Python Part 2

This is part 2 of a 3 part introductory series on machine learning in Python, using the Titanic dataset.



Patrick Triest, SocialCops.

Editor's note: This is the second part of a 3 part introductory series on machine learning with Python. Catch up with yesterday's post first if you need.

Why Machine Learning?

We can draw some fairly straightforward conclusions from this data: Being a woman, being in 1st class, and being a child were all factors that could boost your chances of survival during this disaster.

Let’s say we wanted to write a program to predict whether a given passenger would survive the disaster. This could be done through an elaborate system of nested if-else statements with some sort of weighted scoring system, but such a program would be long, tedious to write, difficult to generalize, and would require extensive fine tuning.

This is where machine learning comes in: we will build a program that learns from the sample data in order to predict whether a given passenger would survive.

Preparing The Data

Before we can feed our dataset into a machine learning algorithm, we have to remove missing values and split it into training and test sets.

If we perform a count of each column, we will see that much of the data on certain fields is missing. Most machine learning algorithms will have a difficult time handling missing values, so we will need to make sure that each row has a value for each column.

titanic_df.count()


pclass       1309
survived     1309
name         1309
sex          1309
age          1046
sibsp        1309
parch        1309
ticket       1309
fare         1308
cabin         295
embarked     1307
boat          486
body          121
home.dest     745
dtype: int64


titanic_df = titanic_df.drop(['body','cabin','boat'], axis=1)
titanic_df["home.dest"] = titanic_df["home.dest"].fillna("NA")
titanic_df = titanic_df.dropna()
titanic_df.count()


pclass       1043
survived     1043
name         1043
sex          1043
age          1043
sibsp        1043
parch        1043
ticket       1043
fare         1043
embarked     1043
home.dest    1043
dtype: int64

Click here for the gist.

Most of the rows are missing values for “boat” and “cabin”, so we will remove these columns from the data frame. A large number of rows are also missing the “home.dest” field; here we fill the missing values with “NA”. A significant number rows are also missing an age value. We have seen above that age could have a significant effect on survival chances, so we will have to drop all of rows that are missing an age value. When we run the count command again we can see that all remaining columns now contain the same number of values.

Now we need to format the remaining data in a way that our machine learning algorithms will accept.

def preprocess_titanic_df(df):
    processed_df = df.copy()
    le = preprocessing.LabelEncoder()
    processed_df.sex = le.fit_transform(processed_df.sex)
    processed_df.embarked = le.fit_transform(processed_df.embarked)
    processed_df = processed_df.drop(['name','ticket','home.dest'],axis=1)
    return processed_df

processed_df = preprocess_titanic_df(titanic_df)

Click here for the gist.

The “sex” and “embarked” fields are both string values that correspond to categories(i.e “Male” and “Female”) so we will run each through a preprocessor. This preprocessor will convert these strings into integer keys, making it easier for the classification algorithms to find patterns. For instance, “Male” and “Female” will be converted to 0 and 1 respectively. The “name”, “ticket”, and “home.dest” columns consist of non-categorical string values, and as such are difficult to use in a classification algorithm, so we will drop them from the dataset.

X = processed_df.drop(['survived'], axis=1).values
y = processed_df['survived'].values

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.2)

Click here for the gist.

Next we separate the dataset into two arrays, “X” containing all of the values for each row besides “survived”, and “y” containing only the “survived” value for that row. The classification algorithms will compare the attribute values of “X” to the corresponding values of “y” in order to detect patterns in how different attributes values tend to affect the survival of a passenger.

Finally we break the “X” and “y” array into two parts each - a training set and a testing set. We will feed the training set into the classification algorithm in order to form a trained model. Once the model is formed, we will use it to classify the testing set, allowing us to determine the accuracy of the model. Here we have have made a 20/80 split, such that 80% of the dataset will be used for training and 20% will be used for testing.