KDnuggets Top Blog Winner

ChatGPT as a Python Programming Assistant

Is ChatGPT useful for Python programmers, specifically those of us who use Python for data processing, data cleaning, and building machine learning models? Let's give it a try and find out.



ChatGPT as a Python Programming Assistant
Original image by Sigmund on Unsplash | Edits by author

 

ChatGPT from OpenAI is... you already know. You've read enough about it, and introductions are no longer needed. If you happen to not know what ChatGPT is, have a look at this first, and then come on back to continue.

Conversations with ChatGPT can be entertaining, even enlightening. But let's have a look at how helpful it can actually be.

Let's see what ChatGPT can do when it comes to writing some code from specifications that we provide. We will start simple and build from there.

 

Simple Data Processing Script

 
I figured that a good place to start was a data processing script of some sort. Since I intend to have ChatGPT write some machine learning scripts afterwards using a variety of Python libraries, this seemed like a reasonable place to start.

 

The Goal

 
First, I wanted to make some attempts at loading a dataset; ensuring all dataset values are numeric values, or converting them to be such; checking for missing values; and splitting the dataset into train and test sets.

 

The Prompt

 
Write a Python script that does the following:
- loads a dataset from csv on disk to Pandas DataFrame
- ensures that dataset is entirely numeric; converts string values to numeric and tracks this conversion in a dictionary to reference later
- checks for missing values in the DataFrame and imputes missing values for any record that has one missing value; entirely drops a record that has more than one missing value
- splits the dataset into test and training sets

 

The Response

 

import pandas as pd

# Load dataset from csv to DataFrame
df = pd.read_csv('dataset.csv')

# Initialize dictionary to track string to numeric conversions
conversions = {}

# Convert string values to numeric and track conversions in dictionary
for col in df.columns:
    if df[col].dtype == object:
        conversions[col] = {val: i for i, val in enumerate(df[col].unique())}
        df[col] = df[col].map(conversions[col])

# Check for missing values
missing_values = df.isna().sum()

# Impute missing values for records with one missing value
for col in missing_values[missing_values == 1].index:
    df[col].fillna(df[col].median(), inplace=True)

# Drop records with more than one missing value
df.dropna(thresh=len(df.columns) - 1, inplace=True)

# Split dataset into test and training sets
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.2)

 

Note: the script loads the dataset from csv file name "dataset.csv" and splits the dataframe into test and training sets with 80/20 ratio. If you want to use different file name or ratio, change the path/ratio accordingly.

 

The Analysis

 
ChatGPT produced a useful script. It decided on a dataset name, as well as a train/test split percentage, and noted both of these points and how to change them if desired. Overall, mission accomplished.

 

Refactoring Our Code

 

The Goal

 
Now that we have a script that does something, let's refactor it a little. A cornerstone of good coding is the use of functions to logically organize you code, facilitate reuse, and ensure you don't repeat yourself.

 

The Prompt

 
Rewrite the above script to use a series of functions that make sense

 

The Response

 

import pandas as pd
from sklearn.model_selection import train_test_split

def load_and_convert_data(file_path):
    # Load dataset from csv to DataFrame
    df = pd.read_csv(file_path)
    
    # Initialize dictionary to track string to numeric conversions
    conversions = {}
    
    # Convert string values to numeric and track conversions in dictionary
    for col in df.columns:
        if df[col].dtype == object:
            conversions[col] = {val: i for i, val in enumerate(df[col].unique())}
            df[col] = df[col].map(conversions[col])
    
    return df, conversions

def handle_missing_values(df):
    # Check for missing values
    missing_values = df.isna().sum()
    
    # Impute missing values for records with one missing value
    for col in missing_values[missing_values == 1].index:
        df[col].fillna(df[col].median(), inplace=True)
    
    # Drop records with more than one missing value
    df.dropna(thresh=len(df.columns) - 1, inplace=True)
    
    return df

def split_data(df, test_size):
    return train_test_split(df, test_size=test_size)

# Use the functions
file_path = "dataset.csv"
df, conversions = load_and_convert_data(file_path)
df = handle_missing_values(df)
train_df, test_df = split_data(df, 0.2)

 

Here, I have used three functions:

1. load_and_convert_data(file_path) loads the csv dataset and convert the string values to numeric.
2. handle_missing_values(df) handles the missing values in the dataset.
3. split_data(df, test_size) splits the dataset into test and training sets.

The main script uses these functions in order to load, convert and clean the dataset, then split it into test and train sets.

The path of the dataset and the test ratio are inputs for these functions.

 

The Analysis

 
If you look at the first script output by ChatGPT, you can probably see some rather obvious lines you would draw as far as what functionality would be separated into functions. I certainly did. ChatGPT did exactly what I expected it to do, in this regard. Again, a competent script has been generated.

 

Extending Our Code

 

The Goal

 
OK, our script looks a little better now that it's using some functions for organization. Next, let's extend the script slightly in order to make it so that the user can specify a few parameters while running the script.

 

The Prompt

 
Rewrite the above program to solicit the following from the command line in the form of arguments, and use them in the program:
- the path/name of the dataset csv file
- the train/test split size
- the threshold for the number of missing values per record to drop after which to drop entire record, instead of imputing the missing values

 

The Response

 

import pandas as pd
import argparse
from sklearn.model_selection import train_test_split

def load_and_convert_data(file_path):
    # Load dataset from csv to DataFrame
    df = pd.read_csv(file_path)
    
    # Initialize dictionary to track string to numeric conversions
    conversions = {}
    
    # Convert string values to numeric and track conversions in dictionary
    for col in df.columns:
        if df[col].dtype == object:
            conversions[col] = {val: i for i, val in enumerate(df[col].unique())}
            df[col] = df[col].map(conversions[col])
    
    return df, conversions

def handle_missing_values(df, threshold):
    # Check for missing values
    missing_values = df.isna().sum()
    
    # Impute missing values for records with one missing value
    for col in missing_values[missing_values == 1].index:
        df[col].fillna(df[col].median(), inplace=True)
    
    # Drop records with more than threshold missing value
    df.dropna(thresh=len(df.columns) - threshold, inplace=True)
    
    return df

def split_data(df, test_size):
    return train_test_split(df, test_size=test_size)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("file_path", help="path/name of the dataset csv file")
    parser.add_argument("test_size", help="train/test split size", type=float)
    parser.add_argument("threshold", help="threshold for the number of missing values per record", type=int)
    args = parser.parse_args()
    
    # Use the functions
    df, conversions = load_and_convert_data(args.file_path)
    df = handle_missing_values(df, args.threshold)
    train_df, test_df = split_data(df, args.test_size)

 

In this version, I have used the argparse module to take the inputs of file path, test/train split size, and threshold for the number of missing values per record as command-line arguments. The user will have to run the script with the correct arguments, for example:

 
python script.py dataset.csv 0.2 1
 

This will run the script using the "dataset.csv" file, a test_size of 0.2 and a threshold of 1 for the missing values.

The main script uses these arguments in the functions to load, convert and clean the dataset, then split it into test and train sets.

It's also worth mentioning that, when running script, the script should be run from command line, rather than running it from an IDE.

 

The Analysis

 
Honestly, not much analysis is needed here, given that ChatGPT explained what it did quite well. I'll actually note something here that I had previously not, and give ChatGPT a pat on the back for its commenting, which is expressive and concise.

 

Converting Our Code to a Streamlit App

 

The Goal

 
Now let's say we wanted to make this code "production" ready (humor me). Let's get ChatGPT to wrap these functions in a Streamlit app.

 

The Prompt

 
Rewrite the above program as a Streamlit app, allowing for the user to provide the same arguments as in the command line args above

 

The Response

 

import pandas as pd
import streamlit as st
from sklearn.model_selection import train_test_split

def load_and_convert_data(file_path):
    # Load dataset from csv to DataFrame
    df = pd.read_csv(file_path)
    
    # Initialize dictionary to track string to numeric conversions
    conversions = {}
    
    # Convert string values to numeric and track conversions in dictionary
    for col in df.columns:
        if df[col].dtype == object:
            conversions[col] = {val: i for i, val in enumerate(df[col].unique())}
            df[col] = df[col].map(conversions[col])
    
    return df, conversions

def handle_missing_values(df, threshold):
    # Check for missing values
    missing_values = df.isna().sum()
    
    # Impute missing values for records with one missing value
    for col in missing_values[missing_values == 1].index:
        df[col].fillna(df[col].median(), inplace=True)
    
    # Drop records with more than threshold missing value
    df.dropna(thresh=len(df.columns) - threshold, inplace=True)
    
    return df

def split_data(df, test_size):
    return train_test_split(df, test_size=test_size)

def main():
    st.set_page_config(page_title="Data Preprocessing", page_icon=":guardsman:", layout="wide")
    st.title("Data Preprocessing")
    
    file_path = st.text_input("Enter the path/name of the dataset csv file: ")
    test_size = st.number_input("Enter the train/test split size (decimal between 0 and 1): ", step=0.01, value=0.2)
    threshold = st.number_input("Enter the threshold for the number of missing values per record: ", step=1, value=1)
    
    if st.button("Process Data"):
        df, conversions = load_and_convert_data(file_path)
        df = handle_missing_values(df, threshold)
        train_df, test_df = split_data(df, test_size)
        st.success("Data preprocessing completed!")

if __name__ == '__main__':
    main()

 

This version is a Streamlit app that allows the user to provide the same arguments as command-line arguments. It uses the Streamlit library to create an interactive web app that allows the user to input the file path, test/train split size, and threshold for the number of missing values per record. The user can then click the "Process Data" button to run the script and preprocess the data. It uses the same functions as the previous version of the script to load, convert, and clean the dataset, then split it into test and train sets.

To run the script, you need to have Streamlit installed, you can run the command pip install streamlit in your terminal. Once installed, run the script with the command streamlit run script.py.

 

The Analysis

 
It did what it was supposed to do. ChatGPT took our functioning... functions and built a simple Streamlit web app with them.

 

Screenshot of the Streamlit web app created with the script generated by ChatGPT
The Streamlit web app created with the script generated by ChatGPT

 

Summary

 
ChatGPT did a perfectly competent job of creating, refactoring, extending, and adding a wrapper around a simple data preprocessing Python script we wanted it to write for us, from the specifications that we provided. While it wasn't the most complex of programs by a long shot, it was a step in the direction of having ChatGPT be genuinely useful by acting as a programming assistant. The results hit the mark at each step.

Next time, we will build a less trivial program with some increased complexity and additional requirements. I'm excited to see how well ChatGPT performs under pressure. Hopefully you are too.

 
 
Matthew Mayo (@mattmayo13) is a Data Scientist and the Editor-in-Chief of KDnuggets, the seminal online Data Science and Machine Learning resource. His interests lie in natural language processing, algorithm design and optimization, unsupervised learning, neural networks, and automated approaches to machine learning. Matthew holds a Master's degree in computer science and a graduate diploma in data mining. He can be reached at editor1 at kdnuggets[dot]com.