KDnuggets Home » News » 2016 » Jul » Tutorials, Overviews » Would You Survive the Titanic? A Guide to Machine Learning in Python Part 2 ( 16:n27 )

# Would You Survive the Titanic? A Guide to Machine Learning in Python Part 2

This is part 2 of a 3 part introductory series on machine learning in Python, using the Titanic dataset.

Patrick Triest, SocialCops.

Editor's note: This is the second part of a 3 part introductory series on machine learning with Python. Catch up with yesterday's post first if you need.

Why Machine Learning?

We can draw some fairly straightforward conclusions from this data: Being a woman, being in 1st class, and being a child were all factors that could boost your chances of survival during this disaster.

Let’s say we wanted to write a program to predict whether a given passenger would survive the disaster. This could be done through an elaborate system of nested if-else statements with some sort of weighted scoring system, but such a program would be long, tedious to write, difficult to generalize, and would require extensive fine tuning.

This is where machine learning comes in: we will build a program that learns from the sample data in order to predict whether a given passenger would survive.

Preparing The Data

Before we can feed our dataset into a machine learning algorithm, we have to remove missing values and split it into training and test sets.

If we perform a count of each column, we will see that much of the data on certain fields is missing. Most machine learning algorithms will have a difficult time handling missing values, so we will need to make sure that each row has a value for each column.

```titanic_df.count()
```

``````pclass       1309
survived     1309
name         1309
sex          1309
age          1046
sibsp        1309
parch        1309
ticket       1309
fare         1308
cabin         295
embarked     1307
boat          486
body          121
home.dest     745
dtype: int64``````

```titanic_df = titanic_df.drop(['body','cabin','boat'], axis=1)
titanic_df["home.dest"] = titanic_df["home.dest"].fillna("NA")
titanic_df = titanic_df.dropna()
titanic_df.count()
```

``````pclass       1043
survived     1043
name         1043
sex          1043
age          1043
sibsp        1043
parch        1043
ticket       1043
fare         1043
embarked     1043
home.dest    1043
dtype: int64```
```

Most of the rows are missing values for “boat” and “cabin”, so we will remove these columns from the data frame. A large number of rows are also missing the “home.dest” field; here we fill the missing values with “NA”. A significant number rows are also missing an age value. We have seen above that age could have a significant effect on survival chances, so we will have to drop all of rows that are missing an age value. When we run the count command again we can see that all remaining columns now contain the same number of values.

Now we need to format the remaining data in a way that our machine learning algorithms will accept.

```def preprocess_titanic_df(df):
processed_df = df.copy()
le = preprocessing.LabelEncoder()
processed_df.sex = le.fit_transform(processed_df.sex)
processed_df.embarked = le.fit_transform(processed_df.embarked)
processed_df = processed_df.drop(['name','ticket','home.dest'],axis=1)
return processed_df

processed_df = preprocess_titanic_df(titanic_df)
```

The “sex” and “embarked” fields are both string values that correspond to categories(i.e “Male” and “Female”) so we will run each through a preprocessor. This preprocessor will convert these strings into integer keys, making it easier for the classification algorithms to find patterns. For instance, “Male” and “Female” will be converted to 0 and 1 respectively. The “name”, “ticket”, and “home.dest” columns consist of non-categorical string values, and as such are difficult to use in a classification algorithm, so we will drop them from the dataset.

```X = processed_df.drop(['survived'], axis=1).values
y = processed_df['survived'].values

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.2)
``` Get KDnuggets, a leading newsletter on AI, Data Science, and Machine Learning