Building a Data Science Portfolio: Machine Learning Project Part 3

The final installment of this comprehensive overview on building an end-to-end data science portfolio project focuses on bringing it all together, and concludes the project quite nicely.

By Vik Paruchuri, Dataquest.

Editor's note: This post picks up where yesterday's left off. You may want to get caught up first!

Pulling everything together

We’re almost ready to pull everything together, we just need to add a bit more code to In the below code, we:

  • Define a function to read in the acquisition data.
  • Define a function to write the processed data to processed/train.csv
  • If this file is called from the command line, like python
    • Read in the acquisition data.
    • Compute the counts for the performance data, and assign them to counts.
    • Annotate the acquisition DataFrame.
    • Write the acquisition DataFrame to train.csv.
def read():
    acquisition = pd.read_csv(os.path.join(settings.PROCESSED_DIR, "Acquisition.txt"), sep="|")
    return acquisition

def write(acquisition):
    acquisition.to_csv(os.path.join(settings.PROCESSED_DIR, "train.csv"), index=False)

if __name__ == "__main__":
    acquisition = read()
    counts = count_performance_rows()
    acquisition = annotate(acquisition, counts)

Once you’re done updating the file, make sure to run it with python, to generate the train.csv file. You can find the complete file here.

The folder should now look like this:

├── data
│   ├── Acquisition_2012Q1.txt
│   ├── Acquisition_2012Q2.txt
│   ├── Performance_2012Q1.txt
│   ├── Performance_2012Q2.txt
│   └── ...
├── processed
│   ├── Acquisition.txt
│   ├── Performance.txt
│   ├── train.csv
├── .gitignore
├── requirements.txt

Finding an error metric

We’re done with generating our training dataset, and now we’ll just need to do the final step, generating predictions. We’ll need to figure out an error metric, as well as how we want to evaluate our data. In this case, there are many more loans that aren’t foreclosed on than are, so typical accuracy measures don’t make much sense.

If we read in the training data, and check the counts in the foreclosure_status column, here’s what we get:

import pandas as pd
import settings

train = pd.read_csv(os.path.join(settings.PROCESSED_DIR, "train.csv"))

False    4635982
True        1585
Name: foreclosure_status, dtype: int64

Since so few of the loans were foreclosed on, just checking the percentage of labels that were correctly predicted will mean that we can make a machine learning model that predicts False for every row, and still gets a very high accuracy. Instead, we’ll want to use a metric that takes the class imbalance into account, and ensures that we predict foreclosures accurately. We don’t want too many false positives, where we make predict that a loan will be foreclosed on even though it won’t, or too many false negatives, where we predict that a loan won’t be foreclosed on, but it is. Of these two, false negatives are more costly for Fannie Mae, because they’re buying loans where they may not be able to recoup their investment.

We’ll define false negative rate as the number of loans where the model predicts no foreclosure but the the loan was actually foreclosed on, divided by the number of total loans that were actually foreclosed on. This is the percentage of actual foreclosures that the model “Missed”. Here’s a diagram:


In the diagram above, 1 loan was predicted as not being foreclosed on, but it actually was. If we divide this by the number of loans that were actually foreclosed on, 2, we get the false negative rate, 50%. We’ll use this as our error metric, so we can evaluate our model’s performance.

Setting up the classifier for machine learning

We’ll use cross validation to make predictions. With cross validation, we’ll divide our data into 3 groups. Then we’ll do the following:

  • Train a model on groups 1 and 2, and use the model to make predictions for group 3.
  • Train a model on groups 1 and 3, and use the model to make predictions for group 2.
  • Train a model on groups 2 and 3, and use the model to make predictions for group 1.

Splitting it up into groups this way means that we never train a model using the same data we’re making predictions for. This avoids overfitting. If we overfit, we’ll get a falsely low false negative rate, which makes it hard to improve our algorithm or use it in the real world.

Scikit-learn has a function called cross_val_predict which will make it easy to perform cross validation.

We’ll also need to pick an algorithm to use to make predictions. We need a classifier that can do binary classification. The target variable, foreclosure_status only has two values, True and False.

We’ll use logistic regression, because it works well for binary classification, runs extremely quickly, and uses little memory. This is due to how the algorithm works – instead of constructing dozens of trees, like a random forest, or doing expensive transformations, like a support vector machine, logistic regression has far fewer steps involving fewer matrix operations.

We can use the logistic regression classifier algorithm that’s implemented in scikit-learn. The only thing we need to pay attention to is the weights of each class. If we weight the classes equally, the algorithm will predictFalse for every row, because it is trying to minimize errors. However, we care much more about foreclosures than we do about loans that aren’t foreclosed on. Thus, we’ll pass balanced to the class_weight keyword argument of the LogisticRegression class, to get the algorithm to weight the foreclosures more to account for the difference in the counts of each class. This will ensure that the algorithm doesn’t predict False for every row, and instead is penalized equally for making errors in predicting either class.