Implementing Adaboost in Scikit-learn

It is called Adaptive Boosting due to the fact that the weights are re-assigned to each instance, with higher weights being assigned to instances that are not correctly classified - therefore it ‘adapts’.



Implementing Adaboost in Scikit-learn
Source: Authors Image

 

What is AdaBoost?

 

The AdaBoost algorithm, short for Adaptive Boosting, is a type of Boosting technique that is used as an Ensemble Method in Machine Learning. It is called Adaptive Boosting due to the fact that the weights are re-assigned to each instance, with higher weights being assigned to instances that are not correctly classified - therefore it ‘adapts’.

If you want to learn more about Ensemble Methods and when to use them, read this article: When Would Ensemble Techniques be a Good Choice?

Boosting is a method that comes under ensemble machine learning algorithms and is used to reduce errors in predictive data analysis. It does this by combining the predictions of weak learners. Boosting was more of a theoretical concept before it became a practical concept. 

If you would like to know more about Boosting, read this article: Boosting Machine Learning Algorithms: An Overview

 

Terminology

 

What is a weak learner? A weak learner refers to a simple model that has a somewhat level of skill and does slightly better than random chance. 

What is a Stump? With tree-based algorithms, a node with two leaves is known as a Stump. Weak learners are almost always stumps.

 

How does AdaBoost work?

 

The AdaBoost algorithm uses short decision trees, during the data training period. The instances that are incorrectly classified are given priority and are used as inputs for the second model - known as weak learners. This process happens again and again until the model attempts to correct the predictions made by the model before. 

The ideas behind AdaBoost:

  1. The combination of weak learners to make classifications
  2. Some stumps have more say in the classification than others
  3. Each stump takes the previous stumps' mistakes into consideration

 

AdaBoost and Scikit-Learn

 

Scikit-Learn provides ensemble methods using a Python machine learning library that implements AdaBoost. AdaBoost can be used both for classification and regression problems, so let’s look into how we can use Scikit-Learn for these types of problems. 

 

Classification

 

sklearn.ensemble.AdaBoostClassifier


The aim of the AdaBoost classifier is to start off with fitting a classifier on the original dataset for the task at hand and then fit additional classifiers where the weights of incorrectly classified instances are adjusted. 

These are the parameters:

sklearn.ensemble.AdaBoostClassifier(base_estimator = None, * ,
    n_estimators = 50, learning_rate = 1.0, algorithm = 'SAMME.R',
    random_state = None)


You can learn more about them and their attributes here

Let’s see it as an example:

Imports: 

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelEncoder


Load dataset:

breast_cancer = load_breast_cancer()
X = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
y = pd.Categorical.from_codes(breast_cancer.target, breast_cancer.target_names)


Encode malignant to 1 and benign to 0:

encoder = LabelEncoder()
binary_encoded_y = pd.Series(encoder.fit_transform(y))


Training/Test set:

train_X, test_X, train_y, test_y = train_test_split(X,
    binary_encoded_y, random_state = 1)


Fit our model:

classifier = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth = 1),
    n_estimators = 200
)
classifier.fit(train_X, train_y)


Make prediction:

​​predictions = classifier.predict(test_X)


Evaluate the model:

confusion_matrix(test_y, predictions)


Output:

array([[86,  2],
       [ 3, 52]])


Code source: Cory Maklin

 

Regression

 

sklearn.ensemble.AdaBoostRegressor


The aim of the AdaBoost regressor is to start off with fitting a regressor on the original dataset for the task at hand and then fit additional regressors where the weights have been adjusted based on the current prediction error.

These are the parameters:

sklearn.ensemble.AdaBoostRegressor(base_estimator = None, * ,
    n_estimators = 50, learning_rate = 1.0, loss = 'linear',
    random_state = None)


If you would like to know more about them and their attributes, click here

Let’s see it as an example:

#evaluate adaboost ensemble
for regression
from numpy
import mean
from numpy
import std
from sklearn.datasets
import make_regression
from sklearn.model_selection
import cross_val_score
from sklearn.model_selection
import RepeatedKFold
from sklearn.ensemble
import AdaBoostRegressor

# define dataset
X, y = make_regression(n_samples = 1000, n_features = 20,
    n_informative = 15, noise = 0.1, random_state = 6)

# define the model
model = AdaBoostRegressor()

# evaluate the model
cv = RepeatedKFold(n_splits = 10, n_repeats = 3, random_state = 1)
n_scores = cross_val_score(model, X, y, scoring =
    'neg_mean_absolute_error', cv = cv, n_jobs = -1, error_score =
    'raise')

# report performance
print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))


Code source: MachineLearningMastery

 

Wrapping it up

 

If you want to learn more about ensemble methods and how you can make better predictions using the techniques: bagging, boosting, and stacking - check out the MachineLearningMastery book: Ensemble Learning Algorithms With Python

Josh Starmer, the Statistics and Machine Learning Guru helped me to better understand AdaBoost through this video: AdaBoost, Clearly Explained

 
 
Nisha Arya is a Data Scientist and Freelance Technical Writer. She is particularly interested in providing Data Science career advice or tutorials and theory based knowledge around Data Science. She also wishes to explore the different ways Artificial Intelligence is/can benefit the longevity of human life. A keen learner, seeking to broaden her tech knowledge and writing skills, whilst helping guide others.