Simple & Intuitive Ensemble Learning in R

Read about metaEnsembleR, an R package for heterogeneous ensemble meta-learning (classification and regression) that is fully-automated.

comments

By Ajay Arunachalam, Orebro University

I always believe in democratizing AI and machine learning, and spreading the knowledge in such a way to cater the larger audiences in general to potentially exploit the power of AI.

One such attempt is the development of the R package for meta-level ensemble learning (Classification, Regression) that is fully-automated. It significantly lowers the barrier for the practitioners to apply heterogeneous ensemble learning techniques in an amateur fashion to their everyday predictive problems.

Before we dwell into the package details, let’s understand a few basic concepts.

Why Ensemble Learning?

Generally, predictions become unreliable when the input sample is out of the training distribution, bias to data distribution or error prone to noise, and so on. Most approaches require changes to the network architecture, fine tuning, balanced data, increasing model size, etc. Further, the selection of the algorithm plays a vital role, while the scalability and learning ability decrease with the complex datasets. Combining multiple learners is an effective approach, and have been applied to many real-world problems. Ensemble learners combine a diverse collection of predictions from the individual base models to produce a composite predictive model that is more accurate and robust than its components. With meta ensemble learning one can minimize generalization error to some extent irrespective of the data distribution, number of classes, choice of algorithm, number of models, complexity of the datasets, etc. So, in summary, the predictive models will be able to generalize better.

How can we build models in more stable fashion while minimizing under-fitting/overfitting which is very critical to the overall outcome? The solution is ensemble meta-learning of a heterogeneous collection of base learners.

Common Ensemble Learning Techniques

The different popular ensemble techniques are referred to in the figure below. Stacked generalization is a general method of using a high-level model to combine lower- level models to achieve greater predictive accuracy. In the Bagging method, the independent base models are derived from the bootstrap samples of the original dataset. The Boosting method grows an ensemble in a dependent fashion iteratively, which adjusts the weight of an observation based on the past prediction. There are several extensions of bagging and boosting.

Overview

metaEnsembleR is an R package for automated meta-learning (Classification, Regression). The functionalities provided includes simple user input based predictive modeling with the selection choice of the algorithms, train-validation-test split, model valuations, and easy guided unseen data prediction which can help the user’s to build stack ensembles on the go. The core aim of this package is to cater the larger audiences in general. metaEnsembleR significantly lowers the barrier for the practitioners to apply heterogeneous ensemble learning techniques in an amateur fashion to their everyday predictive problems.

Using metaEnsembleR

The package consists of the following components:

Ensemble Classifiers Training and Prediction
Ensemble Regressor Training and Prediction
Model Evaluation, Model Results (Observation vs. Prediction on test data) & new unseen data prediction and Disk write I/O performance charts & saving prediction results

All these functions are very intuitive, and their use is illustrated with examples below covering the Classification and Regression problem in general.

Getting Started

The package can be installed directly from CRAN

Install from Rconsole:

install.packages(“metaEnsembleR”)

However, the latest stable version (if any) could be found on Github, and installed using devtools package.

Install from GitHub:

if(!require(devtools)) install.packages(“devtools”)
devtools::install_github(repo = ‘ajayarunachalam/metaEnsembleR’, ref = ‘main’)

Usage

library(“metaEnsembleR”)
set.seed(111)

Training the ensemble classification model is as simple as one-line call to the ensembler.classifier function, in the following ways either passing the csv file directly or the imported dataframe, that takes into account the arguments in the following order starting the Dataset, Outcome/Response Variable index, Base Learners, Final Learner, Train-Validation-Test split ratio, and the Unseen data

ensembler_return ← ensembler.classifier(iris[1:130,], 5, c(‘treebag’,’rpart’), ‘gbm’, 0.60, 0.20, 0.20, read.csv(‘./unseen_data.csv’))

unseen_new_data_testing ← iris[130:150,]
ensembler_return ← ensembler.classifier(iris[1:130,], 5, c(‘treebag’,’rpart’), ‘gbm’, 0.60, 0.20, 0.20, unseen_new_data_testing)

The above function returns the following, i.e., test data with the predictions, prediction labels, model result, and finally the predictions on unseen data.

testpreddata ← data.frame(ensembler_return[1])
table(testpreddata$actual_label)
table(ensembler_return[2])

Performance comparison

modelresult ← ensembler_return[3]
modelresult

Unseen data

unseenpreddata ← data.frame(ensembler_return[4])
table(unseenpreddata$unseenpreddata)

Training the ensemble regression model is the same as one-line call to the ensembler.regression function, in the following ways either passing the csv file directly or the imported dataframe, that takes into account the arguments in the following order starting the Dataset, Outcome/Response Variable index, Base Learners, Final Learner, Train-Validation-Test split ratio, and the Unseen data

house_price ←read.csv(file = ‘./data/regression/house_price_data.csv’)
unseen_new_data_testing_house_price ←house_price[250:414,]
write.csv(unseen_new_data_testing_house_price, ‘unseen_house_price_regression.csv’, fileEncoding = ‘UTF-8’, row.names = F)
ensembler_return ← ensembler.regression(house_price[1:250,], 1, c(‘treebag’,’rpart’), ‘gbm’, 0.60, 0.20, 0.20, read.csv(‘./unseen_house_price_regression.csv’))

ensembler_return ← ensembler.regression(house_price[1:250,], 1, c(‘treebag’,’rpart’), ‘gbm’, 0.60, 0.20, 0.20, unseen_new_data_testing_house_price )

The above function returns the following, i.e., test data with the predictions, prediction values, model result, and finally the unseen data with the predictions.

testpreddata ← data.frame(ensembler_return[1])

Performance comparison

modelresult ← ensembler_return[3]
modelresult
write.csv(modelresult[[1]], “performance_chart.csv”)

Unseen data

unseenpreddata ← data.frame(ensembler_return[4])

Examples

Classification

library(“metaEnsembleR”)
attach(iris)
data(“iris”)
unseen_new_data_testing ← iris[130:150,]
write.csv(unseen_new_data_testing, ‘unseen_check.csv’, fileEncoding = ‘UTF-8’, row.names = F)
ensembler_return ← ensembler.classifier(iris[1:130,], 5, c(‘treebag’,’rpart’), ‘gbm’, 0.60, 0.20, 0.20, unseen_new_data_testing)
testpreddata ← data.frame(ensembler_return[1])
table(testpreddata$actual_label)
table(ensembler_return[2])

Performance comparison

modelresult ← ensembler_return[3]
modelresult
act_mybar ← qplot(testpreddata$actual_label, geom= “bar”)
act_mybar
pred_mybar ← qplot(testpreddata$predictions, geom= ‘bar’)
pred_mybar
act_tbl ← tableGrob(t(summary(testpreddata$actual_label)))
pred_tbl ← tableGrob(t(summary(testpreddata$predictions)))
ggsave(“testdata_actual_vs_predicted_chart.pdf”,grid.arrange(act_tbl, pred_tbl))
ggsave(“testdata_actual_vs_predicted_plot.pdf”,grid.arrange(act_mybar, pred_mybar))

Unseen data

unseenpreddata ← data.frame(ensembler_return[4])
table(unseenpreddata$unseenpreddata)
table(unseen_new_data_testing$Species)

Regression

library(“metaEnsembleR”)
data(“rock”)
unseen_rock_data ← rock[30:48,]
ensembler_return ← ensembler.regression(rock[1:30,], 4,c(‘lm’), ‘rf’, 0.40, 0.30, 0.30, unseen_rock_data)
testpreddata ← data.frame(ensembler_return[1])

Performance comparison

modelresult ← ensembler_return[3]
modelresult
write.csv(modelresult[[1]], “performance_chart.csv”)

Unseen data

unseenpreddata ← data.frame(ensembler_return[4])

More Examples

Comprehensive demonstrations can be found in the Demo.R file, to see the results run Rscript Demo.R in the terminal.

If there is some implementation you would like to see here or add in some examples feel free to do so. You can always reach me at ajay.aruanchalam08@gmail.com

Always Keep Learning & Sharing Knowledge!!!

Bio: Ajay Arunachalam (personal website) is a Postdoctoral Researcher (Artificial Intelligence) at Centre for Applied Autonomous Sensor Systems, Orebro University, Sweden. Prior to this, he was working as a Data Scientist at True Corporation, a Communications Conglomerate, working with Petabytes of data, building & deploying deep models in production. He truly believes that Opacity in AI systems is need of the hour, before we fully accept the power of AI. With this in mind, he has always strived to democratize AI, and be more inclined towards building Interpretable Models. His interest is in Applied Artificial Intelligence, Machine Learning, Deep Learning, Deep RL, and Natural Language Processing, specifically learning good representations. From his experience working on real-world problems, he fully acknowledges that finding good representations is the key in designing the system that can solve interesting challenging real-world problems, that go beyond human-level intelligence, and ultimately explain complicated data for us that we don't understand. In order to achieve this, he envisions learning algorithms that can learn feature representations from both unlabelled and labelled data, be guided with and/or without human interaction, and that are on different levels of abstractions in order to bridge the gap between low-level data and high-level abstract concepts.

Original. Reposted with permission.

Related:

Simple & Intuitive Ensemble Learning in R

Why Ensemble Learning?

Common Ensemble Learning Techniques

Overview

Getting Started

Usage

Examples

Classification

Regression

More Examples

More On This Topic

Latest Posts

Top Posts