Would You Survive the Titanic? A Guide to Machine Learning in Python Part 3

This is the final part of a 3 part introductory series on machine learning in Python, using the Titanic dataset.

These Are Not Just Data Points, They’re People

Given that the accuracy for all of our models is maxing out around 80% it will be interesting to look into specific passengers for whom these classification algorithms are incorrect.

passengers_set_1 = titanic_df[titanic_df.pclass == 1].iloc[:20,:].copy()
passengers_set_2 = titanic_df[titanic_df.pclass == 2].iloc[:20,:].copy()
passengers_set_3 = titanic_df[titanic_df.pclass == 3].iloc[:20,:].copy()
passenger_set = pd.concat([passengers_set_1,passengers_set_2,passengers_set_3])
testing_set = preprocess_titanic_df(passenger_set)

training_set = pd.concat([titanic_df, passenger_set]).drop_duplicates(keep=False)
training_set = preprocess_titanic_df(training_set)

X_train = training_set.drop(['survived'], axis=1).values
y_train = training_set['survived'].values
X_test = testing_set.drop(['survived'], axis=1).values
y_test = testing_set['survived'].values

tf_clf_dnn.fit (X_train, y_train)
tf_clf_dnn.score (X_test, y_test)

Step #100, epoch #25, avg. train loss: 0.64422
Step #200, epoch #50, avg. train loss: 0.60906
Step #300, epoch #75, avg. train loss: 0.59641
Step #400, epoch #100, avg. train loss: 0.58298
Step #500, epoch #125, avg. train loss: 0.56000
Step #600, epoch #150, avg. train loss: 0.53058
Step #700, epoch #175, avg. train loss: 0.50669
Step #800, epoch #200, avg. train loss: 0.48891
Step #900, epoch #225, avg. train loss: 0.47792
Step #1000, epoch #250, avg. train loss: 0.46642


Click here for the gist.

The above code forms a test dataset of the first 20 listed passengers for each class, and trains a deep neural network against the remaining data.

Once the model is trained we can use it to predict the survival of passengers in the test dataset, and compare these to the known survival of each passenger using the original dataset.

prediction = tf_clf_dnn.predict(X_test)
passenger_set[passenger_set.survived != prediction]

Demonstration analysis

Click here for the gist.

The above table show all of the passengers in our test dataset whose survival(or lack thereof) was incorrectly classified by the neural network model.

Sometimes when you are dealing the datasets like this the human side of the story can get lost beneath the complicated math and statistical analysis. By examining passengers for whom our classification model was incorrect, we can begin to uncover some of the most fascinating, and sometimes tragic, stories of humans defying the odds.

For instance, the first three incorrectly classified passengers are all members of the Allison family, who perished even though the model predicted that they would survive. These first-class passengers were very wealthy, as can be evidenced by their far above average ticket prices. For Betsy(25) and Loraine(2) in particular, not surviving is very surprising, considering that we found earlier that over 96% of first class women lived through the disaster.

Allison family

So what happened? A surprising amount of information on each Titanic passenger is available online; it turns out that the Allison family were unable to find their youngest son, Trevor, and were unwilling to evacuate the ship without him. Tragically, Trevor was already safe in a lifeboat with his nurse, and was the only member of the Allison family to survive the sinking.

John Jacob Astor

Another interesting misclassification is John Jacob Astor, who perished in the disaster even though the model predicted he would survive. Astor was the wealthiest person on the Titanic, an impressive feat on a ship full of multimillionaire industrialists, railroad tycoons, and aristocrats. Given his immense wealth and influence, which the model may have deduced from his ticket fare(valued at over $35,000 in 2016), it seems likely that he would have been among of the 35% of men in first class to survive. However, this was not the case: although his pregnant wife survived, John Jacob Astor’s body was recovered a week later, along with a gold watch; a diamond ring with three stones; and no less than $92,481(2016 value) in cash.

Olaus Jorgensen Abelseth

On the other end of the spectrum is Olaus Jorgensen Abelseth, a 25 year old Norwegian sailor. Abelseth, as a man in 3rd class, was not expected to survive by our classifier. Once the ship sank, however, he was able to stay alive by swimming for 20 minutes in the frigid North Atlantic water before joining other survivors on a waterlogged collapsible boat and rowing through the night. Abelseth got married three years later, settled down as a farmer in North Dakota, had 4 kids, and died in 1980 at the age of 94.

When we looked into datapoints for which our model was wrong, we could uncover incredible stories of human nature driving people to defy their logical fate. It is important to never lose sight of the human element when analyzing this type of dataset. This principle will be especially important going forward, as machine learning is increasingly applied to human datasets by organizations such as insurance companies, big banks, and law enforcement agencies.

Where next?

So there you have it, a primer for data analysis and machine learning in Python. From here you can fine-tune the machine learning algorithms to achieve better accuracy on this dataset, design your own neural networks using TensorFlow, discover more fascinating stories of passengers whose survival does not match the model, and apply all of these techniques to any other dataset (check out this Game of Thrones dataset). When it comes to machine learning, the possibilities are endless and the opportunities are titanic.

Patrick TriestBio: Patrick Triest is a 23 year old Android Developer / IoT Engineer / Data Scientist / wannabe pioneer, originally from Boston and now working at SocialCops. He’s addicted to learning, and sometimes after figuring out something particularly cool he gets really excited and writes about it.

Original. Reposted with permission.