Tuning Random Forest Hyperparameters

Hyperparameter tuning is important for algorithms. It improves their overall performance of a machine learning model and is set before the learning process and happens outside of the model.



Tuning Random Forest Hyperparameters
Jungle vector created by freepik

 

If you don’t already know, let’s quickly go over Random Forest. 

Random Forest is an Ensemble Learning method for classification, regression, and other tasks that contain multiple Decision Trees. 

Ensembling Learning in the most simplest explanation is stacking together a lot of classifiers to improve performance. Decision Trees are a non-parametric supervised learning method where the end goal is to build a model that predicts the value of a target variable by learning rules that have been inferred and created based on data features. 

 

sklearn.ensemble.RandomForestClassifier


A Random Forest is made up of many decision trees. A multitude of trees builds a forest, I guess that’s why it’s called Random Forest. 

Bagging is the method that creates the ‘forest’ in Random Forests. Its aim is to reduce the complexity of models that overfit the training data. Boosting is the opposite of Bagging and aims to increase the complexity of models that suffer from high bias, resolving underfitting.

The Random Forests outcomes are based on the predictions generated by the Decision Trees, which is done by taking the average or mean of the output from the various Decision Trees. If there is an increase in the number of trees, the precision of the outcome increases - therefore better accuracy, and overfitting is reduced. 

 

The importance of Hyperparameter Tuning

 

Hyperparameter tuning is important for algorithms. It improves their overall performance of a machine learning model and is set before the learning process and happens outside of the model. If hyperparameter tuning does not occur, the model will produce errors and inaccurate results as the loss function is not minimized.

Hyperparameter tuning is about finding a set of optimal hyperparameter values which maximizes the model's performance, minimizes loss and produces better outputs. 

 

Hyperparameters of a Random Forest

 

Below is the list of the most important parameters and below that is a more refined section on how to improve prediction power and your model training phase easier. 

max_depth: The maximum depth of the tree - meaning the longest path between the root node and the leaf node.

min_sample_split: The minimum number of samples required to split an internal node:where the default = 2

max_leaf_nodes: This is the maximum number of leaf nodes a decision tree can have..

min_samples_leaf: This is the minimum number of samples required to be at a leaf node where the default = 1

n_estimators: This is the number of trees in the forest.

max_sample: This determines the fraction of the original dataset that is given to any individual tree.

max_features: This is the number of features to consider when looking for the best split.

bootstrap: If this is set as False, the whole dataset is used to build each tree, but it is set as Default.

criterion: The function to measure the quality of a split

 

Hyperparameter Tuning to improve predictions

 

n_estimators : int, default=100

This is the number of trees in the forest. As mentioned before, with an increase in the number of trees, the precision of the outcome increases - therefore better accuracy, and overfitting is reduced. However, this will make your model slower - therefore choosing an n_estimator value which your processor can handle allows your model to be more stable and perform well.

max_features{“sqrt”, “log2”, None}, int or float, default=”sqrt”}

A Random Forest model can only have a maximum number of features in an individual tree. Many would assume that if you increase max_features, this will improve the overall performance of your model. However, this naturally decreases the diversity of individual trees which would also increase the time it took the model to produce outputs. Therefore, finding an optimal max_features is important to your model's performance.

min_samples_leaf : int or float, default=1

This is the minimum number of samples required to be at a leaf node. A leaf node is the end node of a decision tree and a smaller min_sample_leaf value will make the model more vulnerable to detecting noise. Again, hyperparameter tuning is about finding the optimum - therefore trying out different leaf sizes is advised.

 

Hyperparameter Tuning to improve model training phase

 

random_state : int, RandomState instance or None, default=None

This parameter controls both the randomness of the bootstrapping of the samples used when building trees and the sampling of the features to consider when looking for the best split at each node. 

n_jobs : int, default=None

This parameter refers to the number of jobs to run in parallel which essentially tells the engine how many processors it can use. -1 means that there are no restrictions, 1 means it can only use 1 processor. 

 

Conclusion

 

This article has given you a breakdown on what Random Forest is, the importance of hyperparameter tuning, the most important parameters and how you can improve your prediction power as well as your model training phase. 

If you would like to know more about these parameters, click on this link.

 
 
Nisha Arya is a Data Scientist and Freelance Technical Writer. She is particularly interested in providing Data Science career advice or tutorials and theory based knowledge around Data Science. She also wishes to explore the different ways Artificial Intelligence is/can benefit the longevity of human life. A keen learner, seeking to broaden her tech knowledge and writing skills, whilst helping guide others.