Easily Integrate LLMs into Your Scikit-learn Workflow with Scikit-LLM

LLM is a powerful model that could improve our text analysis. With Scikit-LLM, we could integrate the LLM easily into our ML pipeline.

By Cornellius Yudha Wijaya, KDnuggets Technical Content Specialist on December 21, 2023 in Language Models

Easily Integrate LLMs into Your Scikit-learn Workflow with Scikit-LLM

Image Generated by DALL-E 2

Text analysis tasks have been around for some time as the needs are always there. Research has come a long way, from simple description statistics to text classification and advanced text generation. With the addition of the Large Language Model in our arsenal, our working tasks become even more accessible.

The Scikit-LLM is a Python package developed for text analysis activity with the power of LLM. This package stood out because we could integrate the standard Scikit-Learn pipeline with the Scikit-LLM.

So, what is this package about, and how does it work? Let’s get into it.

Scikit-LLM

Scikit-LLM is a Python package to enhance text data analytic tasks via LLM. It was developed by Beatsbyte to help bridge the standard Scikit-Learn library and the power of the language model. Scikit-LLM created its API to be similar to the SKlearn library, so we don’t have too much trouble using it.

Installation

To use the package, we need to install them. To do that, you can use the following code.

pip install scikit-llm

As of the time this article was written, Scikit-LLM is only compatible with some of the OpenAI and GPT4ALL Models. That’s why we would only going to work with the OpenAI model. However, you can use the GPT4ALL model by installing the component initially.

pip install scikit-llm[gpt4all]

After installation, you must set up the OpenAI key to access the LLM models.

from skllm.config import SKLLMConfig

SKLLMConfig.set_openai_key("")
SKLLMConfig.set_openai_org("")

Trying out Scikit-LLM

Let’s try out some Scikit-LLM capabilities with the environment set. One ability that LLMs have is to perform text classification without retraining, which we call Zero-Shot. However, we would initially try a Few-Shot text classification with the sample data.

from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset


#label: Positive, Neutral, Negative
X, y = get_classification_dataset()


#Initiate the model with GPT-3.5
clf = ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")
clf.fit(X, y)
labels = clf.predict(X)

You only need to provide the text data within the X variable and the label y in the dataset. In this case, the label consists of the sentiment, which is Positive, Neutral, or Negative.

As you can see, the process is similar to using the fitting method in the Scikit-Learn package. However, we already know that Zero-Shot didn’t necessarily require a dataset for training. That’s why we can provide the labels without the training data.

X, _ = get_classification_dataset()

clf = ZeroShotGPTClassifier()
clf.fit(None, ["positive", "negative", "neutral"])
labels = clf.predict(X)

This could also be extended in the multilabel classification cases, which you can see in the following code.

from skllm import MultiLabelZeroShotGPTClassifier
from skllm.datasets import get_multilabel_classification_dataset
X, _ = get_multilabel_classification_dataset()
candidate_labels = [
    "Quality",
    "Price",
    "Delivery",
    "Service",
    "Product Variety",
    "Customer Support",
    "Packaging",,
]
clf = MultiLabelZeroShotGPTClassifier(max_labels=4)
clf.fit(None, [candidate_labels])
labels = clf.predict(X)

What’s amazing about the Scikit-LLM is that it allows the user to extend the power of LLM to the typical Scikit-Learn pipeline.

Scikit-LLM in the ML Pipeline

In the next example, I will show how we can initiate the Scikit-LLM as a vectorizer and use XGBoost as the model classifier. We would also wrap the steps into the model pipeline.

First, we would load the data and initiate the label encoder to transform the label data into a numerical value.

from sklearn.preprocessing import LabelEncoder

X, y = get_classification_dataset()

le = LabelEncoder()
y_train_enc = le.fit_transform(y_train)
y_test_enc = le.transform(y_test)

Next, we would define a pipeline to perform vectorization and model fitting. We can do that with the following code.

from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from skllm.preprocessing import GPTVectorizer

steps = [("GPT", GPTVectorizer()), ("Clf", XGBClassifier())]
clf = Pipeline(steps)

#Fitting the dataset
clf.fit(X_train, y_train_enc)

Lastly, we can perform prediction with the following code.

pred_enc = clf.predict(X_test)
preds = le.inverse_transform(pred_enc)

As we can see, we can use the Scikit-LLM and XGBoost under the Scikit-Learn pipeline. Combining all the necessary packages would make our prediction even stronger.

There are still various tasks you can do with Scikit-LLM, including model fine-tuning, which I suggest you check the documentation to learn further. You can also use the open-source model from GPT4ALL if necessary.

Conclusion

Scikit-LLM is a Python package that empowers Scikit-Learn text data analysis tasks with LLM. In this article, we have discussed how we use Scikit-LLM for text classification and combine them into the machine learning pipeline.

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.

Easily Integrate LLMs into Your Scikit-learn Workflow with Scikit-LLM

Scikit-LLM

Installation

Trying out Scikit-LLM

Scikit-LLM in the ML Pipeline

Conclusion

More On This Topic

Latest Posts

Top Posts