5 Great New Features in Latest Scikit-learn Release
From not sweating missing values, to determining feature importance for any estimator, to support for stacking, and a new plotting API, here are 5 new features of the latest release of Scikit-learn which deserve your attention.
The latest release of Python's workhorse machine learning library includes a number of new features and bug fixes. You can find a full accounting of these changes from the official Scikit-learn 0.22 release highlights, and can read find the change log here.
Updating your installation is done via pip:
pip install --upgrade scikit-learn
conda install scikit-learn
Here are 5 new features in the latest release of Scikit-learn which are worth your attention.
1. New Plotting API
A new plotting API is available, working without requiring any recomputation. Supported plots include, among others, partial dependence plots, confusion matrix, and ROC curves. Here's a demonstration of the API, using an example from Scikit-learn's user guide:
Note the plotting is done via the single last line of code.
2. Stacked Generalization
The ensemble learning technique of stacking estimators for bias reduction has come to Scikit-learn.
StackingRegressor are the modules enabling estimator stacking, and the
final_estimator uses these stacked estimator predictions as its input. See this example from the user guide, using the regression estimators defined below as
estimators, with a gradient boosting regressor final estimator:
3. Feature Importance for Any Estimator
The permutation importance of a feature is calculated as follows. First, a baseline metric, defined by scoring, is evaluated on a (potentially different) dataset defined by the X. Next, a feature column from the validation set is permuted and the metric is evaluated again. The permutation importance is defined to be the difference between the baseline metric and metric from permutating the feature column.
A full example from the release notes:
4. Gradient Boosting Missing Value Support
The gradient boosting classifier and regressor are now both natively equipped to deal with missing values, thus eliminating the need to manually impute. Here's how missing value decisions are made:
During training, the tree grower learns at each split point whether samples with missing values should go to the left or right child, based on the potential gain. When predicting, samples with missing values are assigned to the left or right child consequently. If no missing values were encountered for a given feature during training, then samples with missing values are mapped to whichever child has the most samples.
The following example demonstrates:
[0 0 1 1]
5. KNN Based Missing Value Imputation
While gradient boosting now natively supports missing value imputation, explicit imputation can be performed on any dataset using the K-nearest neighbors imputer. Each missing value is imputed from the mean of n nearest neighbors, in the training set, so long as the features which neither sample are missing are near. Euclidean distance is the distance default metric used.
Happy machine learning!
- Train sklearn 100x Faster
- How to Extend Scikit-learn and Bring Sanity to Your Machine Learning Workflow
- Scikit-Learn & More for Synthetic Dataset Generation for Machine Learning