Model Selection and Experimentation Automation with LLMs

Automate the machine learning modelling important step with LLMs.



Model Selection and Experimentation Automation with LLM
Image by Editor | Ideogram

 

Large language models, or LLMs, have become a tool that facilitates our work in many ways, from answering our questions to generating task lists. Individuals and businesses use them to help their work.

Code generation and evaluation have recently become big things many business products offer to help developers work with their code. LLMs can also be extended to data science work, especially for model selection and experimentation.

This article will explore how automation is used for model selection and experimentation. You can always change up the structure of this article, but we see the possibility there.

Let’s get into it.

 

Model Selection and Experimentation Automation with LLMs

 
We will set up the dataset we would use for the model training and the code for the automation. We will use the Credit Car Fraud Data dataset from Kaggle for this example. Here is what I do to prepare them for the preprocessing process.

import pandas as pd

df = pd.read_csv('fraud_data.csv')
df = df.drop(['trans_date_trans_time', 'merchant', 'dob', 'trans_num', 'merch_lat', 'merch_long'], axis =1)

df = df.dropna().reset_index(drop = True)
df.to_csv('fraud_data.csv', index = False)

 

We will only use some of the datasets and drop every missing data. This is not the optimal process, but we focus on model selection and experimentation.

Next, we will prepare a folder for our project and place all the related files there. First, we will create the requirements.txt file for the environment. You can fill them with the packages below.

openai
pandas
scikit-learn
pyyaml

 

Next, we will use the YAML file for all the related metadata. This would include the OpenAI API key, the model to test, the evaluation metrics, and the dataset's location.

llm_api_key: "YOUR-OPENAI-API-KEY"
default_models:
  - LogisticRegression
  - DecisionTreeClassifier
  - RandomForestClassifier
metrics: ["accuracy", "precision", "recall", "f1_score"]
dataset_path: "fraud_data.csv"

 

Then, we import the packages used in the process here. We will rely on Scikit-Learn for the modelling process and OpenAI's GPT-4 as the LLM.

import pandas as pd
import yaml
import ast
import re
import sklearn
from openai import OpenAI
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

 

Additionally, we would set up the helper function and information to help the process. From the dataset load, data preprocessing, and configuration loader is in the function below.

model_mapping = {
    "LogisticRegression": LogisticRegression,
    "DecisionTreeClassifier": DecisionTreeClassifier,
    "RandomForestClassifier": RandomForestClassifier
}

def load_config(config_path='config.yaml'):
    with open(config_path, 'r') as file:
        config = yaml.safe_load(file)
    return config

def load_data(dataset_path):
    return pd.read_csv(dataset_path)

def preprocess_data(df):
    label_encoders = {}
    for column in df.select_dtypes(include=['object']).columns:
        le = LabelEncoder()
        df[column] = le.fit_transform(df[column])
        label_encoders[column] = le
    return df, label_encoders

 

In the same file, we will set the LLM as the expert in the machine learning role. We will use the following code to initiate that.

def call_llm(prompt, api_key):
    client = OpenAI(api_key=api_key)
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are an expert in machine learning and able to evaluate the model well."},
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content.strip()

 

You can change the LLM model to what you want, like an open-source one from Hugging Face, but we recommend sticking with OpenAI for now.

I will prepare a function to clean up the LLM result in the following code. This ensures that the output can be used for the follow-up process in the model selection and experimentation step.

def clean_hyperparameter_suggestion(suggestion):
    pattern = r'\{.*?\}'
    match = re.search(pattern, suggestion, re.DOTALL)
    if match:
        cleaned_suggestion = match.group(0)
        return cleaned_suggestion
    else:
        print("Could not find a dictionary in the hyperparameter suggestion.")
        return None

def extract_model_name(llm_response, available_models):
    for model in available_models:
        pattern = r'\b' + re.escape(model) + r'\b'
        if re.search(pattern, llm_response, re.IGNORECASE):
            return model
    return None

def validate_hyperparameters(model_class, hyperparameters):
    valid_params = model_class().get_params()
    invalid_params = []
    for param, value in hyperparameters.items():
        if param not in valid_params:
            invalid_params.append(param)
        else:
            if param == 'max_features' and value == 'auto':
                print(f"Invalid value for parameter '{param}': '{value}'")
                invalid_params.append(param)
    if invalid_params:
        print(f"Invalid hyperparameters for {model_class.__name__}: {invalid_params}")
        return False
    return True

def correct_hyperparameters(hyperparameters, model_name):
    corrected = False
    if model_name == "RandomForestClassifier":
        if 'max_features' in hyperparameters and hyperparameters['max_features'] == 'auto':
            print("Correcting 'max_features' from 'auto' to 'sqrt' for RandomForestClassifier.")
            hyperparameters['max_features'] = 'sqrt'
            corrected = True
    return hyperparameters, corrected

 

Then, we will need the function to initiate the model and evaluation training process. The code below would be used to train the model by accepting the splitter dataset, the model name we have mapping, and the hyperparameters. The result will be the metrics and the model object.

def train_and_evaluate(X_train, X_test, y_train, y_test, model_name, hyperparameters=None):
    if model_name not in model_mapping:
        print(f"Valid model names are: {list(model_mapping.keys())}")
        return None, None

    model_class = model_mapping.get(model_name)
    try:
        if hyperparameters:
            hyperparameters, corrected = correct_hyperparameters(hyperparameters, model_name)
            if not validate_hyperparameters(model_class, hyperparameters):
                return None, None
            model = model_class(**hyperparameters)
        else:
            model = model_class()
    except Exception as e:
        print(f"Error instantiating model with hyperparameters: {e}")
        return None, None
    try:
        model.fit(X_train, y_train)
    except Exception as e:
        print(f"Error during model fitting: {e}")
        return None, None


    y_pred = model.predict(X_test)
    metrics = {
        "accuracy": accuracy_score(y_test, y_pred),
        "precision": precision_score(y_test, y_pred, average='weighted', zero_division=0),
        "recall": recall_score(y_test, y_pred, average='weighted', zero_division=0),
        "f1_score": f1_score(y_test, y_pred, average='weighted', zero_division=0)
    }
    return metrics, model

 

With all the preparation ready, we can set up the automation process. There are a few steps we do for the automation, which include:

  1. Train and evaluate all models
  2. LLM Selecting the best model
  3. Check for hyperparameter tuning from the best model
  4. Automatically run hyperparameter tuning if suggested by LLM
def run_llm_based_model_selection_experiment(df, config):
    #Model Training
    X = df.drop("is_fraud", axis=1)
    y = df["is_fraud"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    available_models = config['default_models']
    model_performance = {}

    for model_name in available_models:
        print(f"Training model: {model_name}")
        metrics, _ = train_and_evaluate(X_train, X_test, y_train, y_test, model_name)
        model_performance[model_name] = metrics
        print(f"Model: {model_name} | Metrics: {metrics}")

    #LLM selecting the best model
    sklearn_version = sklearn.__version__
    prompt = (
        f"I have trained the following models with these metrics: {model_performance}. "
        "Which model should I select based on the best performance?"
    )
    best_model_response = call_llm(prompt, config['llm_api_key'])
    print(f"LLM response for best model selection:\n{best_model_response}")

    best_model = extract_model_name(best_model_response, available_models)
    if not best_model:
        print("Error: Could not extract a valid model name from LLM response.")
        return
    print(f"LLM selected the best model: {best_model}")

    #Check for hyperparameter tuning
    prompt_tuning = (
        f"The selected model is {best_model}. Can you suggest hyperparameters for better performance? "
        "Please provide them in Python dictionary format, like {'max_depth': 5, 'min_samples_split': 4}. "
        f"Ensure that all suggested hyperparameters are valid for scikit-learn version {sklearn_version}, "
        "and avoid using deprecated or invalid values such as 'max_features': 'auto'. "
        "Don't provide any explanation or return in any other format."
    )
    tuning_suggestion = call_llm(prompt_tuning, config['llm_api_key'])
    print(f"Hyperparameter tuning suggestion received:\n{tuning_suggestion}")

    cleaned_suggestion = clean_hyperparameter_suggestion(tuning_suggestion)
    if cleaned_suggestion is None:
        suggested_params = None
    else:
        try:
            suggested_params = ast.literal_eval(cleaned_suggestion)
            if not isinstance(suggested_params, dict):
                print("Hyperparameter suggestion is not a valid dictionary.")
                suggested_params = None
        except (ValueError, SyntaxError) as e:
            print(f"Error parsing hyperparameter suggestion: {e}")
            suggested_params = None

    #Automatically run hyperparameter tuning if suggested
    if suggested_params:
        print(f"Running {best_model} with suggested hyperparameters: {suggested_params}")
        tuned_metrics, _ = train_and_evaluate(
            X_train, X_test, y_train, y_test, best_model, hyperparameters=suggested_params
        )
        print(f"Metrics after tuning: {tuned_metrics}")
    else:
        print("No valid hyperparameters were provided for tuning.")

 

In the code above, I have specified how the LLM could evaluate each of our models based on the experiment. We are using the following prompts to select which models to use based on their performance.

prompt = (
        f"I have trained the following models with these metrics: {model_performance}. "
        "Which model should I select based on the best performance?")

 

You can always change the prompt to implement a different rule for the model selection.

Once the best model has been selected, I will use the following prompt to suggest what hyperparameters should be used for the follow-up process. I am also specifying the Scikit-Learn version as the hyperparameters can be varied depending on the version.

prompt_tuning = (
        f"The selected model is {best_model}. Can you suggest hyperparameters for better performance? "
        "Please provide them in Python dictionary format, like {'max_depth': 5, 'min_samples_split': 4}. "
        f"Ensure that all suggested hyperparameters are valid for scikit-learn version {sklearn_version}, "
        "and avoid using deprecated or invalid values such as 'max_features': 'auto'. "
        "Don't provide any explanation or return in any other format.")

 

You can change the prompt in any way you want, such as by tuning hyperparameters more exploratively or including another technique.

I put all the code above in one file called automated_model_llm.py. Lastly, add the following code to run the whole process.

def main():
    config = load_config()
    df = load_data(config['dataset_path'])
    df, _ = preprocess_data(df)
    run_llm_based_model_selection_experiment(df, config)


if __name__ == "__main__":
    main()

 

Once everything is ready, you can run the following code to execute the code.

python automated_model_llm.py

 

Output:

LLM response for best model selection:

Looking at the metrics shared, the RandomForestClassifier is the model performing the best. It has the highest accuracy (0.9723119520073835), precision (0.9715734023282823), recall (0.9723119520073835), and f1_score (0.9717111855357631) compared to the LogisticRegression and DecisionTreeClassifier models.

LLM selected the best model: RandomForestClassifier
Hyperparameter tuning suggestion received:
{
'n_estimators': 100,
'max_depth': None,
'min_samples_split': 2,
'min_samples_leaf': 1,
'max_features': 'sqrt',
'bootstrap': True
}
Running RandomForestClassifier with suggested hyperparameters: {'n_estimators': 100, 'max_depth': None, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'bootstrap': True}
Metrics after tuning: {'accuracy': 0.9730041532071989, 'precision': 0.9722907483489197, 'recall': 0.9730041532071989, 'f1_score': 0.9724045530119824}

 

That was the example output coming from my experiment. It might be different from yours. You can set the prompt and the generation parameters to have a more varied or rigid LLM output. Nevertheless, you can apply LLM to model selection and experiment automation if you structure the code correctly.
 

Conclusion

 
LLM has been used in many use cases, including code generation. By applying LLM, such as the OpenAI GPT Model, we can easily delegate the task of model selection and experimentation as long as we structure the output correctly. In the example, we use a sample dataset to experiment with the model and ask LLM to select and experiment to improve it.

I hope this has helped!
 
 

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.



No, thanks!