Silver Blog, Apr 20175 Machine Learning Projects You Can No Longer Overlook, April

It's about that time again... 5 more machine learning or machine learning-related projects you may not yet have heard of, but may want to consider checking out. Find tools for data exploration, topic modeling, high-level APIs, and feature selection herein.

It's time for yet another installment of "5 Machine Learning Projects You Can No Longer Overlook" -- the modest quest of bringing formidable, lesser-known machine learning projects to a few additional sets of eyes -- this time for April 2017. Previous lists have included both general purpose and specialized machine learning and deep learning libraries, along with auxiliary support, data cleaning, and automation tools.

This time around we showcase 5 more machine learning-related projects which you may not yet heard of, including those from across a number of different ecosystems and programming languages. You may find that, even if you have no requirement for any of these particular tools, inspecting their broad implementation details or their specific code may help in generating some ideas of your own. Like the previous iteration, there is no formal criteria for inclusion beyond projects that have caught my eye over time spent online, and the projects have Github repositories. Yes, it's subjective, but there is no qualitative approach to this task that would make any sense.

Without further ado, here they are: yet another 5 machine learning projects you should consider having a look at. They are in no order, but are numbered to appear as though they are, primarily to help calm the anxiety I have toward unnumbered lists.

1. Scikit-plot

Scikit-plot is the result of an unartistic data scientist's dreadful realization that visualization is one of the most crucial components in the data science process, not just a mere afterthought.


I first came across Scikit-plot via Reddit post by the author, and was using it almost immediately. The project aims to bring a series of standard, useful plots to Scikit-learn users without any fuss, since we're all interested in using them anyways. Examples of plots included in the library are:

  • Elbow plots
  • Feature importance graphs
  • PCA projection plots
  • ROC curves
  • Silhouette plots

The library has 2 APIs, one of which integrates tightly enough with Scikit-learn in order to control calls to its API (the Factory API). The other is more orthodox in nature (the Functions API), but either one would suffice depending on your desires.

Find a quickstart guide here with everything you need to get going with this promising little library.

2. scikit-feature

scikit-feature is an open-source feature selection repository in Python developed by Data Mining and Machine Learning Lab at Arizona State University. It is built upon one widely used machine learning package scikit-learn and two scientific computing packages Numpy and Scipy. scikit-feature contains around 40 popular feature selection algorithms, including traditional feature selection algorithms and some structural and streaming feature selection algorithms.

Though all methods of feature selection share the common goal of identifying redundant and irrelevant features, there are numerous algorithms for approaching these related problems -- this is an active area of research. In that regard, scikit-feature is for both practical feature selection and feature selection algorithm research. The project has a website hosted by ASU, where it was conceived (originally in MATLAB, but later ported to the Python ecosystem). A list of algorithms which are supported by scikit-feature can be found here.

As data scientist Rubens Zimbres so eloquently put it very recently (emphasis added):

After some experiences, using stacked neural nets, parallel neural nets, asymmetric configs, simple neural nets, multiple layers, dropouts, activation functions etc there is one conclusion: There's NOTHING like a good Feature Selection.

3. Smile

Smile (Statistical Machine Intelligence and Learning Engine) is a fast and comprehensive machine learning system. With advanced data structures and algorithms, Smile delivers state-of-art performance.

Smile covers every aspect of machine learning, including classification, regression, clustering, association rule mining, feature selection, manifold learning, multidimensional scaling, genetic algorithms, missing value imputation, efficient nearest neighbor search, etc.


Smile now seems to be the go-to general-purpose machine learning library for those working in the Java and Scala worlds -- a JVM Scikit-learn, if you will. The project has a very comprehensive tutorial website, covering not only Smile's operation, but also serving as a quality introduction to machine learning algorithms in general.

Smile is definitely worth a look if you are doing machine learning on the JVM. I would actually find it hard to believe that you are working in that ecosystem and are unaware of the project.

4. Gensim

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.

Versatile, and aiming for completeness, Gensim implements "[e]fficient multicore implementations of popular algorithms, such as online Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec deep learning."

Gensim's documentation is here. You can find a beginner's topic modeling using Gensim tutorial published on KDnuggets last year (and written by Gensim's developers) here.

5. Sonnet

From the blog post announcement of Sonnet's open-sourcing in early April:

Since its initial launch in November 2015, a diverse ecosystem of higher level libraries has sprung up around TensorFlow enabling common tasks to be accomplished quicker. Sonnet shares many similarities with some of these existing neural network libraries, but has some features specifically designed around our research requirements. The code release accompanying our Learning to learn paper included a preliminary version of Sonnet, and other forthcoming code releases will be built on top of the full library we are releasing today.


DeepMind has open-sourced its high level TensorFlow library, which the organization admits is similar to other such libraries, but which implements features they desire, in particular:

Recurrent Neural Network states are often best represented as a collection of heterogeneous Tensors, and representing these as a flat list can be error prone. Sonnet provides utilities to deal with these arbitrary hierarchies[.]

Hopefully this has opened your eyes to some libraries you were unaware of, or some functionality you didn't realize you wanted for yourself.