KDnuggets Home » News » 2018 » Aug » Tutorials, Overviews » Programming Best Practices For Data Science ( 18:n30 )

Programming Best Practices For Data Science


In this post, I'll go over the two mindsets most people switch between when doing programming work specifically for data science: the prototype mindset and the production mindset.



By Srini Kadamati, Dataquest.io

The data science life cycle is generally comprised of the following components:

  • data retrieval
  • data cleaning
  • data exploration and visualization
  • statistical or predictive modeling

While these components are helpful for understanding the different phases, they don't help us think about our programming workflow.

Often, the entire data science life cycle ends up as an arbitrary mess of notebook cells in eithera Jupyter Notebook or a single messy script. In addition, most data science problems require us to switch between data retrieval, data cleaning, data exploration, data visualization, and statistical / predictive modeling.

But there's a better way! In this post, I'll go over the two mindsets most people switch between when doing programming work specifically for data science: the prototype mindset and the production mindset.

Prototype mindset prioritizes: Production mindset prioritizes:
iteration speed on small pieces of code iteration speed on the full pipeline
less abstraction (directly modifying code and data objects) more abstraction (modifying parameter values instead)
less structure to the code (less modularity) more structure to the code (more modularity)
helping you and others understand the code and data helping a computer run code automatically

 

I personally use JupyterLab for the entire process (both prototyping and productionizing). I recommend using JupyterLab at least for prototyping.

 

Lending Club data

 
To help more concretely understand the difference between the prototyping and the production mindset, let's work with some real data. We'll work with lending data from the peer-to-peer lending site, Lending Club. Unlike a bank, Lending Club doesn't lend money itself. Lending Club is instead a marketplace for lenders to lend money to individuals who are seeking loans for a variety of reasons (home repairs, wedding costs, etc.). We can use this data to build models that will predict if a given loan application will be successful or not. We won't dive into building a machine learning pipeline for making predictions in this post, but we cover it in our Machine Learning Project Walkthrough Course.

Lending Club offers detailed, historical data on both completed loans (loan applications that were approved by Lending Club and they found lenders) and declined loans (loan application was declined by Lending Club and money never changed hands). Navigate to their data download page and select 2007-2011 under DOWLNOAD LOAN DATA.

lendingclub

 

Prototype mindset

 
In the prototype mindset, we're interested in quickly iterating and trying to understand some properties and truths about the data. Create a new Jupyter notebook and add a Markdown cell that explains:

  • Any research you did on Lending Club to better understand the platform
  • Any information on the data set you downloaded

First things first, let's read the CSV file into pandas.

import pandas as pd
loans_2007 = pd.read_csv('LoanStats3a.csv')
loans_2007.head(2)


We get two pieces of output, first a warning.

/home/srinify/anaconda3/envs/dq2/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2785: DtypeWarning: Columns (0,1,2,3,4,7,13,18,24,25,27,28,29,30,31,32,34,36,37,38,39,40,41,42,43,44,46,47,49,50,51,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,123,124,125,126,127,128,129,130,131,132,133,134,135,136,142,143,144) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)


Then the first 5 rows of the dataframe, which we'll avoid showing here (because it's quite long).

We also got the following dataframe output:

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
d member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status pymnt_plan url desc purpose title zip_code addr_state dti delinq_2yrs earliest_cr_line inq_last_6mths mths_since_last_delinq mths_since_last_record open_acc pub_rec revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_d last_pymnt_amnt next_pymnt_d last_credit_pull_d collections_12_mths_ex_med mths_since_last_major_derog policy_code application_type annual_inc_joint dti_joint verification_status_joint acc_now_delinq tot_coll_amt tot_cur_bal open_acc_6m open_act_il open_il_12m open_il_24m mths_since_rcnt_il total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m acc_open_past_24mths avg_cur_bal bc_open_to_buy bc_util chargeoff_within_12_mths delinq_amnt mo_sin_old_il_acct mo_sin_old_rev_tl_op mo_sin_rcnt_rev_tl_op mo_sin_rcnt_tl mort_acc mths_since_recent_bc mths_since_recent_bc_dlq mths_since_recent_inq mths_since_recent_revol_delinq num_accts_ever_120_pd num_actv_bc_tl num_actv_rev_tl num_bc_sats num_bc_tl num_il_tl num_op_rev_tl num_rev_accts num_rev_tl_bal_gt_0 num_sats num_tl_120dpd_2m num_tl_30dpd num_tl_90g_dpd_24m num_tl_op_past_12m pct_tl_nvr_dlq percent_bc_gt_75 pub_rec_bankruptcies tax_liens tot_hi_cred_lim total_bal_ex_mort total_bc_limit total_il_high_credit_limit revol_bal_joint sec_app_earliest_cr_line sec_app_inq_last_6mths sec_app_mort_acc sec_app_open_acc sec_app_revol_util sec_app_open_act_il sec_app_num_rev_accts sec_app_chargeoff_within_12_mths sec_app_collections_12_mths_ex_med sec_app_mths_since_last_major_derog hardship_flag hardship_type hardship_reason hardship_status deferral_term hardship_amount hardship_start_date hardship_end_date payment_plan_start_date hardship_length hardship_dpd hardship_loan_status orig_projected_additional_accrued_interest hardship_payoff_balance_amount hardship_last_payment_amount disbursement_method debt_settlement_flag debt_settlement_flag_date settlement_status settlement_date settlement_amount settlement_percentage settlement_term


The warning lets us know that the type inferencing of pandas for each column would be improved if we set the low_memory parameter to False when calling pandas.read_csv().

The second output is more problematic because the way the DataFrame is storing the data has issues. JupyterLab has a terminal environment built in, so we can open it and use the bash command head to observe the first two lines of the raw file:

head -2 LoanStats3a.csv


While the second line contains the column names as we expect in a CSV file, it looks like the first line is throwing off the formatting of the DataFrame when pandas tries to parse the file:

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)


Add a Markdown cell that details your observations and add a code cell that factors in the observations.

import pandas as pd
loans_2007 = pd.read_csv('LoanStats3a.csv', skiprows=1, low_memory=False)


Read the data dictionary from the Lending Club download page to understand which columns don't contain useful information for features. The desc and url columns seem to immediately fit this criteria.

loans_2007 = loans_2007.drop(['desc', 'url'],axis=1)


The next step is to drop any columns with more than 50% missing rows. Use one cell to explore which columns meet that criteria, and another to actually drop the columns.

loans_2007.isnull().sum()/len(loans_2007)


loans_2007 = loans_2007.dropna(thresh=half_count, axis=1)


Because we're using a Jupyter notebook to track our thoughts and our code, we're relying on the environment (via the IPython kernel) to keep track of changes of the state. This frees us up to be freeform, move cells around, run the same code multiple times, etc.

In general, code in the prototyping mindset should focus on:

  • Understandability
    • Markdown cells to describe our observations and assumptions
    • Small pieces of code for the actual logic
    • Lots of visualizations and counts
  • Minimal abstractions
    • Most code shouldn't be in functions (should feel more object-oriented)

Let's say we spent another hour exploring the data and writing markdown cells that describe the data cleaning we did. We can then switch over to the production mindset and make the code more robust.

 

Production mindset

 
In the production mindset, we want to focus on writing code that will generalize to more situations. In our case, we want our data cleaning code to work for any of the data sets from Lending Club (from other time periods). The best way to generalize our code is to turn it into a data pipeline. A data pipeline is designed using principles from functional programming, where data is modified within functions and then passed between functions.

Here's a first iteration of this pipeline using a single function to encapsulate data cleaning code:

import pandas as pd

def import_clean(file_list):
    frames = []
    for file in file_list:
        loans = pd.read_csv(file, skiprows=1, low_memory=False)
        loans = loans.drop(['desc', 'url'], axis=1)
        half_count = len(loans)/2
        loans = loans.dropna(thresh=half_count, axis=1)
        loans = loans.drop_duplicates()
        # Drop first group of features
        loans = loans.drop(["funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)
        # Drop second group of features
        loans = loans.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)
        # Drop third group of features
        loans = loans.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)
        frames.append(loans)
    return frames
    
frames = import_clean(['LoanStats3a.csv'])


In the code above, we abstracted the code from earlier into a single function. The input to this function is a list of filenames and the output is a list of DataFrame objects.

In general, the production mindset should focus on:

  • Healthy abstractions
    • Code should generalize to be compatible with similar data sources
    • Code shouldn't be so general that it becomes cumbersome to understand
  • pipeline stability
    • Reliability should match how frequently its run (daily? weekly? monthly?)

 

Switching between mindsets

 
Let's say we tried to run the function for all of the data sets from Lending Club and Python returned errors. Some potential sources for errors:

  • Variance in column names in some of the files
  • Variance in columns being dropped because of the 50% missing value threshold
  • Different column types based on pandas type inference for that file

In those cases, we should actually switch back to our prototype notebook and investigate further. When we've determined that we want our pipeline to be more flexible and account for specific variations in the data, we can re-incorporate that back into the pipeline logic.

Here's an example where we adapted the function to accommodate for a different drop threshold value:

import pandas as pd

def import_clean(file_list, threshold=0.5):
    frames = []
    for file in file_list:
        loans = pd.read_csv(file, skiprows=1, low_memory=False)
        loans = loans.drop(['desc', 'url'], axis=1)
        threshold_count = len(loans)*threshold
        loans = loans.dropna(thresh=half_count, axis=1)
        loans = loans.drop_duplicates()
        # Drop first group of features
        loans = loans.drop(["funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)
        # Drop second group of features
        loans = loans.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)
        # Drop third group of features
        loans = loans.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)
        frames.append(loans)
    return frames
    
frames = import_clean(['LoanStats3a.csv'], threshold=0.7)


The default value is still 0.5, but we can over-ride it to 0.7 if we want.

Here are a few ways to make the pipeline more flexible, in decreasing priority:

  • Use optional, positional, and required arguments
  • Use if / then statements along with Boolean input values within the functions
  • Use new data structures (dictionaries, lists, etc.) to represent custom actions for specific datasets

This pipeline can scale to all phases of the data science workflow. Here's some skeleton code that previews how this looks.

import pandas as pd

def import_clean(file_list, threshold=0.5):
    ## Code
    
def visualize(df_list):
    # Find the most important features and generate pairwise scatter plots
    # Display visualizations and write to file.
    plt.savefig("scatter_plots.png")

def combine(df_list):
    # Combine dataframes and generate train and test sets
    # Drop features all dataframes don't share
    # Return both train and test dataframes
    return train,test
    
def train(train_df):
    # Train model
    return model
    
def validate(train_df, test-df):
    # K-fold cross validation
    # Return metrics dictionary
    return metrics_dict
    
frames = import_clean(['LoanStats3a.csv', 'LoanStats2012.csv'], threshold=0.7)
visualize(frames)
train_df, test_df = combine(frames)
model = train(train_df)
metrics = test(train_df, test_df)
print(metrics)


 

Next steps

 
If you're interested in deepening your understanding and practicing further, I recommend the following next steps:

 
Bio: Srini Kadamati is Director of Content at Dataquest.io.

Original. Reposted with permission.

Related:


Sign Up