Programming Best Practices For Data Science

In this post, I'll go over the two mindsets most people switch between when doing programming work specifically for data science: the prototype mindset and the production mindset.

comments

By Srini Kadamati, Dataquest.io

The data science life cycle is generally comprised of the following components:

data retrieval
data cleaning
data exploration and visualization
statistical or predictive modeling

While these components are helpful for understanding the different phases, they don't help us think about our programming workflow.

Often, the entire data science life cycle ends up as an arbitrary mess of notebook cells in eithera Jupyter Notebook or a single messy script. In addition, most data science problems require us to switch between data retrieval, data cleaning, data exploration, data visualization, and statistical / predictive modeling.

But there's a better way! In this post, I'll go over the two mindsets most people switch between when doing programming work specifically for data science: the prototype mindset and the production mindset.

Prototype mindset prioritizes:	Production mindset prioritizes:
iteration speed on small pieces of code	iteration speed on the full pipeline
less abstraction (directly modifying code and data objects)	more abstraction (modifying parameter values instead)
less structure to the code (less modularity)	more structure to the code (more modularity)
helping you and others understand the code and data	helping a computer run code automatically

I personally use JupyterLab for the entire process (both prototyping and productionizing). I recommend using JupyterLab at least for prototyping.

Lending Club data

To help more concretely understand the difference between the prototyping and the production mindset, let's work with some real data. We'll work with lending data from the peer-to-peer lending site, Lending Club. Unlike a bank, Lending Club doesn't lend money itself. Lending Club is instead a marketplace for lenders to lend money to individuals who are seeking loans for a variety of reasons (home repairs, wedding costs, etc.). We can use this data to build models that will predict if a given loan application will be successful or not. We won't dive into building a machine learning pipeline for making predictions in this post, but we cover it in our Machine Learning Project Walkthrough Course.

Lending Club offers detailed, historical data on both completed loans (loan applications that were approved by Lending Club and they found lenders) and declined loans (loan application was declined by Lending Club and money never changed hands). Navigate to their data download page and select 2007-2011 under DOWLNOAD LOAN DATA.

Prototype mindset

In the prototype mindset, we're interested in quickly iterating and trying to understand some properties and truths about the data. Create a new Jupyter notebook and add a Markdown cell that explains:

Any research you did on Lending Club to better understand the platform
Any information on the data set you downloaded

First things first, let's read the CSV file into pandas.

import pandas as pd
loans_2007 = pd.read_csv('LoanStats3a.csv')
loans_2007.head(2)

We get two pieces of output, first a warning.

/home/srinify/anaconda3/envs/dq2/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2785: DtypeWarning: Columns (0,1,2,3,4,7,13,18,24,25,27,28,29,30,31,32,34,36,37,38,39,40,41,42,43,44,46,47,49,50,51,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,123,124,125,126,127,128,129,130,131,132,133,134,135,136,142,143,144) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Then the first 5 rows of the dataframe, which we'll avoid showing here (because it's quite long).

We also got the following dataframe output:

																																																																																																																																																Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
d	member_id	loan_amnt	funded_amnt	funded_amnt_inv	term	int_rate	installment	grade	sub_grade	emp_title	emp_length	home_ownership	annual_inc	verification_status	issue_d	loan_status	pymnt_plan	url	desc	purpose	title	zip_code	addr_state	dti	delinq_2yrs	earliest_cr_line	inq_last_6mths	mths_since_last_delinq	mths_since_last_record	open_acc	pub_rec	revol_bal	revol_util	total_acc	initial_list_status	out_prncp	out_prncp_inv	total_pymnt	total_pymnt_inv	total_rec_prncp	total_rec_int	total_rec_late_fee	recoveries	collection_recovery_fee	last_pymnt_d	last_pymnt_amnt	next_pymnt_d	last_credit_pull_d	collections_12_mths_ex_med	mths_since_last_major_derog	policy_code	application_type	annual_inc_joint	dti_joint	verification_status_joint	acc_now_delinq	tot_coll_amt	tot_cur_bal	open_acc_6m	open_act_il	open_il_12m	open_il_24m	mths_since_rcnt_il	total_bal_il	il_util	open_rv_12m	open_rv_24m	max_bal_bc	all_util	total_rev_hi_lim	inq_fi	total_cu_tl	inq_last_12m	acc_open_past_24mths	avg_cur_bal	bc_open_to_buy	bc_util	chargeoff_within_12_mths	delinq_amnt	mo_sin_old_il_acct	mo_sin_old_rev_tl_op	mo_sin_rcnt_rev_tl_op	mo_sin_rcnt_tl	mort_acc	mths_since_recent_bc	mths_since_recent_bc_dlq	mths_since_recent_inq	mths_since_recent_revol_delinq	num_accts_ever_120_pd	num_actv_bc_tl	num_actv_rev_tl	num_bc_sats	num_bc_tl	num_il_tl	num_op_rev_tl	num_rev_accts	num_rev_tl_bal_gt_0	num_sats	num_tl_120dpd_2m	num_tl_30dpd	num_tl_90g_dpd_24m	num_tl_op_past_12m	pct_tl_nvr_dlq	percent_bc_gt_75	pub_rec_bankruptcies	tax_liens	tot_hi_cred_lim	total_bal_ex_mort	total_bc_limit	total_il_high_credit_limit	revol_bal_joint	sec_app_earliest_cr_line	sec_app_inq_last_6mths	sec_app_mort_acc	sec_app_open_acc	sec_app_revol_util	sec_app_open_act_il	sec_app_num_rev_accts	sec_app_chargeoff_within_12_mths	sec_app_collections_12_mths_ex_med	sec_app_mths_since_last_major_derog	hardship_flag	hardship_type	hardship_reason	hardship_status	deferral_term	hardship_amount	hardship_start_date	hardship_end_date	payment_plan_start_date	hardship_length	hardship_dpd	hardship_loan_status	orig_projected_additional_accrued_interest	hardship_payoff_balance_amount	hardship_last_payment_amount	disbursement_method	debt_settlement_flag	debt_settlement_flag_date	settlement_status	settlement_date	settlement_amount	settlement_percentage	settlement_term

The warning lets us know that the type inferencing of pandas for each column would be improved if we set the low_memory parameter to False when calling pandas.read_csv().

The second output is more problematic because the way the DataFrame is storing the data has issues. JupyterLab has a terminal environment built in, so we can open it and use the bash command head to observe the first two lines of the raw file:

head -2 LoanStats3a.csv

While the second line contains the column names as we expect in a CSV file, it looks like the first line is throwing off the formatting of the DataFrame when pandas tries to parse the file:

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)

Add a Markdown cell that details your observations and add a code cell that factors in the observations.

import pandas as pd
loans_2007 = pd.read_csv('LoanStats3a.csv', skiprows=1, low_memory=False)

Read the data dictionary from the Lending Club download page to understand which columns don't contain useful information for features. The desc and url columns seem to immediately fit this criteria.

loans_2007 = loans_2007.drop(['desc', 'url'],axis=1)

The next step is to drop any columns with more than 50% missing rows. Use one cell to explore which columns meet that criteria, and another to actually drop the columns.

loans_2007.isnull().sum()/len(loans_2007)

loans_2007 = loans_2007.dropna(thresh=half_count, axis=1)

Because we're using a Jupyter notebook to track our thoughts and our code, we're relying on the environment (via the IPython kernel) to keep track of changes of the state. This frees us up to be freeform, move cells around, run the same code multiple times, etc.

In general, code in the prototyping mindset should focus on:

Understandability
- Markdown cells to describe our observations and assumptions
- Small pieces of code for the actual logic
- Lots of visualizations and counts
Minimal abstractions
- Most code shouldn't be in functions (should feel more object-oriented)

Let's say we spent another hour exploring the data and writing markdown cells that describe the data cleaning we did. We can then switch over to the production mindset and make the code more robust.

Production mindset

In the production mindset, we want to focus on writing code that will generalize to more situations. In our case, we want our data cleaning code to work for any of the data sets from Lending Club (from other time periods). The best way to generalize our code is to turn it into a data pipeline. A data pipeline is designed using principles from functional programming, where data is modified within functions and then passed between functions.

Here's a first iteration of this pipeline using a single function to encapsulate data cleaning code:

import pandas as pd

def import_clean(file_list):
    frames = []
    for file in file_list:
        loans = pd.read_csv(file, skiprows=1, low_memory=False)
        loans = loans.drop(['desc', 'url'], axis=1)
        half_count = len(loans)/2
        loans = loans.dropna(thresh=half_count, axis=1)
        loans = loans.drop_duplicates()
        # Drop first group of features
        loans = loans.drop(["funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)
        # Drop second group of features
        loans = loans.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)
        # Drop third group of features
        loans = loans.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)
        frames.append(loans)
    return frames
    
frames = import_clean(['LoanStats3a.csv'])

In the code above, we abstracted the code from earlier into a single function. The input to this function is a list of filenames and the output is a list of DataFrame objects.

In general, the production mindset should focus on:

Healthy abstractions
- Code should generalize to be compatible with similar data sources
- Code shouldn't be so general that it becomes cumbersome to understand
pipeline stability
- Reliability should match how frequently its run (daily? weekly? monthly?)

Switching between mindsets

Let's say we tried to run the function for all of the data sets from Lending Club and Python returned errors. Some potential sources for errors:

Variance in column names in some of the files
Variance in columns being dropped because of the 50% missing value threshold
Different column types based on pandas type inference for that file

In those cases, we should actually switch back to our prototype notebook and investigate further. When we've determined that we want our pipeline to be more flexible and account for specific variations in the data, we can re-incorporate that back into the pipeline logic.

Here's an example where we adapted the function to accommodate for a different drop threshold value:

import pandas as pd

def import_clean(file_list, threshold=0.5):
    frames = []
    for file in file_list:
        loans = pd.read_csv(file, skiprows=1, low_memory=False)
        loans = loans.drop(['desc', 'url'], axis=1)
        threshold_count = len(loans)*threshold
        loans = loans.dropna(thresh=half_count, axis=1)
        loans = loans.drop_duplicates()
        # Drop first group of features
        loans = loans.drop(["funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)
        # Drop second group of features
        loans = loans.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)
        # Drop third group of features
        loans = loans.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)
        frames.append(loans)
    return frames
    
frames = import_clean(['LoanStats3a.csv'], threshold=0.7)

The default value is still 0.5, but we can over-ride it to 0.7 if we want.

Here are a few ways to make the pipeline more flexible, in decreasing priority:

Use optional, positional, and required arguments
Use if / then statements along with Boolean input values within the functions
Use new data structures (dictionaries, lists, etc.) to represent custom actions for specific datasets

This pipeline can scale to all phases of the data science workflow. Here's some skeleton code that previews how this looks.

import pandas as pd

def import_clean(file_list, threshold=0.5):
    ## Code
    
def visualize(df_list):
    # Find the most important features and generate pairwise scatter plots
    # Display visualizations and write to file.
    plt.savefig("scatter_plots.png")

def combine(df_list):
    # Combine dataframes and generate train and test sets
    # Drop features all dataframes don't share
    # Return both train and test dataframes
    return train,test
    
def train(train_df):
    # Train model
    return model
    
def validate(train_df, test-df):
    # K-fold cross validation
    # Return metrics dictionary
    return metrics_dict
    
frames = import_clean(['LoanStats3a.csv', 'LoanStats2012.csv'], threshold=0.7)
visualize(frames)
train_df, test_df = combine(frames)
model = train(train_df)
metrics = test(train_df, test_df)
print(metrics)

Next steps

If you're interested in deepening your understanding and practicing further, I recommend the following next steps:

Learn how to turn your pipeline into a standalone script that can be run as a module or from the command line: https://docs.python.org/3/library/main.html
Learn how to use Luigi to build more complex pipelines that can be run in the cloud: Building Data Pipelines in Python and Luigi
Learn more about data engineering: Data Engineering Posts on Dataquest

Bio: Srini Kadamati is Director of Content at Dataquest.io.

Original. Reposted with permission.

Related:

Programming Best Practices For Data Science

Lending Club data

Prototype mindset

Production mindset

Switching between mindsets

Next steps

More On This Topic

Latest Posts

Top Posts

<img width="94" height="95" src="/images/tkb-1808-s.png" width=94 alt="Silver Blog" align="right">Programming Best Practices For Data Science

Lending Club data

Prototype mindset

Production mindset

Switching between mindsets

Next steps

More On This Topic

Latest Posts

Top Posts

Programming Best Practices For Data Science