5 Great New Features in Scikit-learn 0.23

Check out 5 new features of the latest Scikit-learn release, including the ability to visualize estimators in notebooks, improvements to both k-means and gradient boosting, some new linear model implementations, and sample weight support for a pair of existing regressors.



The latest release of Python's workhorse machine learning library has been released, and version 0.23 includes a number of new features and bug fixes. You can find the release highlights on the official Scikit-learn website, and can find the exhaustive release notes here.

Updating your installation is done via pip:

   pip install --upgrade scikit-learn

or conda:

   conda install scikit-learn

With that out of the way, here are 5 features in Scikit-learn's latest release you should know about.


1. Visual Representation of Estimators in Notebooks

By using Scikit-learn's set_config() module, one can enable the global display='diagram' option in your Jupyter notebooks. Once set, it can be used to provide visual summarization of the structures of both pipelines and composite estimators you have employed in your notebooks. The resultant diagrams as interactive, allowing sections such as pipelines, transformers and more to be expanded. See an example of an expanded diagram below (from Scikit-learn's website).




2. Improvements to K-Means

The Scikit-learn implementation of k-means has been revamped. It now purports to be faster and more stable. OpenMP parallelism has also now been adopted, and so the joblib-reliant n_jobs training parameter has gone the way of the dodo. Check out the Scikit-learn parallelism notes for more info on thread control.

Also, the Elkan algorithm now supports sparse matrices.


3. Improvements to Gradient Boosting

Both the HistGradientBoostingClassifier and HistGradientBoostingRegressor have received numerous improvements. Support for early stopping has been introduced, and is enabled by default for datasets with a number of samples greater than 10K. Monotonic constraints are also now supported, allowing predictions to be constrained based on specific features, which you can read more about here. The addition of a simple monotonic constraint is shown below.

gbdt_cst = HistGradientBoostingRegressor(monotonic_cst=[1, 0]).fit(X, y)


HistGradientBoostingRegressor has added support for a new poisson loss as well.

gbdt = HistGradientBoostingRegressor(loss='poisson', learning_rate=.01)



4. New Generalized Linear Models

Three new regressors with non-normal loss functions have been added this time around:



There isn't a whole lot more to say superficially about these, other than they are implementations of a generalized linear regressor with these three different distributions. You can read more details in the generalized linear regression documentation.


5. Sample Weight Support for Existing Regressors

We get some new regressors this time around, as outlined above, but we also get support for sample weighting in a pair of existing reggressors, namely Lasso and ElasticNet. It is easily implemented; as a parameter to the regressor instantiation, it requires as input an array of the size of the number of samples in the dataset (shown as randomly generated in the snippet below, from Scikit-learn's website; highlights added).

n_samples, n_features = 1000, 20
rng = np.random.RandomState(0)
X, y = make_regression(n_samples, n_features, random_state=rng)
sample_weight = rng.rand(n_samples)
X_train, X_test, y_train, y_test, sw_train, sw_test = train_test_split(
    X, y, sample_weight, random_state=rng)
reg = Lasso()
reg.fit(X_train, y_train, sample_weight=sw_train)
print(reg.score(X_test, y_test, sw_test))


For more information on all of the changes and updates in Scikit-learn 0.23, have a look at the full release notes.

Happy machine learning!