LGBMClassifier: A Getting Started Guide

This tutorial explores the LightGBM library in Python to build a classification model using the LGBMClassifier class.

LGBMClassifier: A Getting-Started Guide
Image by Editor 


There are a vast number of machine learning algorithms that are apt to model specific phenomena. While some models utilize a set of attributes to outperform others, others include weak learners to utilize the remainder of attributes for providing additional information to the model, known as ensemble models.

The premise of the ensemble models is to improve the model performance by combining the predictions from different models by reducing their errors. There are two popular ensembling techniques: bagging and boosting. 

Bagging, aka Bootstrapped Aggregation, trains multiple individual models on different random subsets of the training data and then averages their predictions to produce the final prediction. Boosting, on the other hand, involves training individual models sequentially, where each model attempts to correct the errors made by the previous models.

Now that we have context about the ensemble models, let us double-click on the boosting ensemble model, specifically the Light GBM (LGBM) algorithm developed by Microsoft. 


What is LGBMClassifier?


LGBMClassifier stands for Light Gradient Boosting Machine Classifier. It uses decision tree algorithms for ranking, classification, and other machine-learning tasks. LGBMClassifier uses a novel technique of Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) to handle large-scale data with accuracy, effectively making it faster and reducing memory usage.


What is Gradient-based One-Side Sampling (GOSS)?


Traditional gradient boosting algorithms use all the data for training, which can be time-consuming when dealing with large datasets. LightGBM's GOSS, on the other hand, keeps all the instances with large gradients and performs random sampling on the instances with small gradients. The intuition behind this is that instances with large gradients are harder to fit and thus carry more information. GOSS introduces a constant multiplier for the data instances with small gradients to compensate for the information loss during sampling.


What is Exclusive Feature Bundling (EFB)?


In a sparse dataset, most of the features are zeros. EFB is a near-lossless algorithm that bundles/combines mutually exclusive features (features that are not non-zero simultaneously) to reduce the number of dimensions, thereby accelerating the training process. Since these features are "exclusive", the original feature space is retained without significant information loss.




The LightGBM package can be installed directly using pip – python's package manager. Type the command shared below either on the terminal or command prompt to download and install the LightGBM library onto your machine:

pip install lightgbm


Anaconda users can install it using the “conda install” command as listed below.

conda install -c conda-forge lightgbm


Based on your OS, you can choose the installation method using this guide.




Now, let's import LightGBM and other necessary libraries:

import numpy as np
import pandas as pd
import seaborn as sns
import lightgbm as lgb
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split


Preparing the Dataset


We are using the popular Titanic dataset, which contains information about the passengers on the Titanic, with the target variable signifying whether they survived or not. You can download the dataset from Kaggle or use the following code to load it directly from Seaborn, as shown below:

titanic = sns.load_dataset('titanic')


Drop unnecessary columns such as “deck”, “embark_town”, and “alive” because they are redundant or do not contribute to the survival of any person on the ship. Next, we observed that the features “age”, “fare”, and “embarked” have missing values – note that different attributes are imputed with appropriate statistical measures.

# Drop unnecessary columns
titanic = titanic.drop(['deck', 'embark_town', 'alive'], axis=1)

# Replace missing values with the median or mode
titanic['age'] = titanic['age'].fillna(titanic['age'].median())
titanic['fare'] = titanic['fare'].fillna(titanic['fare'].mode()[0])
titanic['embarked'] = titanic['embarked'].fillna(titanic['embarked'].mode()[0])


Lastly, we convert the categorical variables to numerical variables using pandas' categorical codes. Now, the data is prepared to start the model training process.

# Convert categorical variables to numerical variables
titanic['sex'] = pd.Categorical(titanic['sex']).codes
titanic['embarked'] = pd.Categorical(titanic['embarked']).codes

# Split the dataset into input features and the target variable
X = titanic.drop('survived', axis=1)
y = titanic['survived']


Training the LGBMClassifier Model


To begin training the LGBMClassifier model, we need to split the dataset into input features and target variables, as well as training and testing sets using the train_test_split function from scikit-learn.

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Let’s label encode categorical (“who”) and ordinal data (“class”) to ensure that the model is supplied with numerical data, as LGBM doesn’t consume non-numerical data.

class_dict = {
"Third": 3,
"First": 1,
"Second": 2
who_dict = {
"child": 0,
"woman": 1,
"man": 2
X_train['class'] = X_train['class'].apply(lambda x: class_dict[x])
X_train['who'] = X_train['who'].apply(lambda x: who_dict[x])
X_test['class'] = X_test['class'].apply(lambda x: class_dict[x])
X_test['who'] = X_test['who'].apply(lambda x: who_dict[x])


Next, we specify the model hyperparameters as arguments to the constructor, or we can pass them as a dictionary to the set_params method.  

The last step to initiate the model training is to load the dataset by creating an instance of the LGBMClassifier class and fitting it to the training data. 

params = {
'objective': 'binary',
'boosting_type': 'gbdt',
'num_leaves': 31,
'learning_rate': 0.05,
'feature_fraction': 0.9
clf = lgb.LGBMClassifier(**params)
clf.fit(X_train, y_train)


Next, let us evaluate the trained classifier’s performance on the unseen or test dataset.

predictions = clf.predict(X_test)
print(classification_report(y_test, predictions))


             precision    recall  f1-score   support

           0       0.84      0.89      0.86       105
           1       0.82      0.76      0.79        74

    accuracy                           0.83       179
   macro avg       0.83      0.82      0.82       179
weighted avg       0.83      0.83      0.83       179


Hyperparameter Tuning


The LGBMClassifier allows for much flexibility via hyperparameters which you can tune for optimal performance. Here, we will briefly discuss some of the key hyperparameters:

  • num_leaves: This is the main parameter to control the complexity of the tree model. Ideally, the value of num_leaves should be less than or equal to 2^(max_depth).
  • min_data_in_leaf: This is an important parameter to prevent overfitting in a leaf-wise tree. Its optimal value depends on the number of training samples and num_leaves.
  • max_depth: You can use this to limit the tree depth explicitly. It's best to tune this parameter in case of overfitting.

Let's tune these hyperparameters and train a new model:

model = lgb.LGBMClassifier(num_leaves=31, min_data_in_leaf=20, max_depth=5)
model.fit(X_train, y_train)


predictions = model.predict(X_test)
print(classification_report(y_test, predictions))


             precision    recall  f1-score   support

           0       0.85      0.89      0.87       105
           1       0.83      0.77      0.80        74

    accuracy                           0.84       179
   macro avg       0.84      0.83      0.83       179
weighted avg       0.84      0.84      0.84       179


Note that the actual tuning of hyperparameters is a process that involves trial and error and may also be guided by experience and a deeper understanding of the boosting algorithm and subject matter expertise (domain knowledge) of the business problem you're working on.

In this post, you learned about the LightGBM algorithm and its Python implementation. It is a flexible technique that is useful for various types of classification problems and should be a part of your machine-learning toolkit.
Vidhi Chugh is an AI strategist and a digital transformation leader working at the intersection of product, sciences, and engineering to build scalable machine learning systems. She is an award-winning innovation leader, an author, and an international speaker. She is on a mission to democratize machine learning and break the jargon for everyone to be a part of this transformation.