3 Reasons Why Data Scientists Should Use LightGBM

There are many great boosting Python libraries for data scientists to reap the benefits of. In this article, the author discusses LightGBM benefits and how they are specific to your data science job.

By Matthew Przybyla, KDnuggets on January 24, 2022 in Machine Learning

Introduction

There are many great boosting Python libraries for data scientists to reap the benefits of. Some include XGBoost, and the new CatBoost algorithm. However, there is one algorithm that combines some of both of these other algorithm characteristics, making it a must for data scientists. The benefits are of course great in learning and education, but more importantly, for working in a quick, professional environment that requires an algorithm that is fast. Below, I will discuss LightGBM [1] benefits and how they are specific to your data science job.

Categorical Encoding

Perhaps the best feature of this library is the categorical feature support. Whereas a lot of data scientists might use one-hot encoding to create tons of new columns for only one categorical feature, this library allows you to specify the categorical features with the categorical_feature parameter.

While one-hot encoding is useful, in academia, inside your Jupyter Notebook, for example, it can be less useful in the professional setting. Say you have 10 categorical features with 100 unique bins, that can be expanded to 1,000 new columns. Not only does this make your dataframe sparse, but it also makes your model incredibly slower. Another stressful outcome for this sparsity is when you have to translate your features into production code for software engineers working on your prediction service and deployment. This transferring of responsibilities (if you have that setup, of course), can be confusing and overwhelming for both parties to have to deal with.

Here are some of the benefits of categorical encoding with LightGBM:

Easier to encode categorical features
Easier to use
Easier to work with other data scientists, software engineers, backend engineers, and product managers
Can retain original column names
Can reap the benefits of categorical features rather than traditional numeric conversion with one-hot encoding
These benefits can ultimately make your model faster and more accurate

Fast

Photo by Andy Beales on Unsplash [3].

Not only does encoding your categorical features make your model faster, but LightGBM also has a few other tricks to improve your training and prediction speeds. LigthGBM uses both GOSS and EFB, or Gradient-based One-Side Sampling, and Exclusive Feature Binding, as well as histogram-based splitting.

Here is why a fast LightGBM model is useful for professionals:

Not every job will allow you weeks or months to come up with a model, and some may even want one the same week — or at least, a proof of concept model
This faster modeling can allow you to test features and parameters faster, ultimately allowing you to work better in a faster environment
Can test more features without slowing down your model as much as in other algorithms

It is simple, it is fast, and when you have a lot of people depending on your model, fast will allow you to help the business more efficiently.

Accurate

Photo by Silvan Arnet on Unsplash [4].

All XGBoost, CatBoost, and LightGBM are accurate models. Yes, it depends on your problem, features, and data ultimately, but in general, these algorithms lead to accurate results after you have performed the necessary steps.

Because you can use categorial features, you will are likely to have an accurate model, more so, than an algorithm that can only perform one-hot encoding. The way that LightGBM splits can lead to more accurate models as well. It is important to note that you will want to prevent overfitting though.

Here are some of the reasons why LightGBM is more accurate, and how it can help you professionally:

Splitting method
Categorical feature support
Of course, everyone wants a more accurate model, especially in a business (just have to make sure you do not overfit)

Summary

Although these benefits are simple, they are incredibly important and make your work a lot easier. As a result, your company — stakeholders and engineers, will be satisfied with you utilizing LightGBM.

To summarize, here are some of the main benefits of using LightGBM professionally:

Categorical Encoding
Fast
Accurate

I hope you found my article both interesting and useful. Please feel free to comment down below if you agree or disagree with these benefits. Why or why not? What other benefits do you think are important to point out in LightGBM? These can certainly be clarified even further, but I hope I was able to shed some light on LightGBM.

Please feel free to check out my Medium profile as well.

References

[1] Microsoft Corporation, LightGBM documentation, (2022)
[2] Photo by Mikhail Vasilyev on Unsplash, (2017)
[3] Photo by Andy Beales on Unsplash, (2015)
[4] Photo by Silvan Arnet on Unsplash, (2020)

Matthew Przybyla (Medium) is a Senior Data Scientist at Favor Delivery based in Texas. He has a Master's degree in Data Science from Southern Methodist University. He enjoys writing about trending topics and tutorials in the data science space, ranging from new algorithms to advice on everyday work experiences for data scientists. Matt likes to highlight the business side of data science as opposed to only the technical side. Feel free to reach out to Matt on his LinkedIn.