Silver Blog, Sep 20175 Machine Learning Projects You Can No Longer Overlook – Episode VI

Deep learning, data preparation, data visualization, oh my! Check out the latest installation of '5 Machine Learning Projects You Can No Longer Overlook' for insight on... well, what machine learning projects you can no longer overlook.

It's time for yet another installment of 5 Machine Learning Projects You Can No Longer Overlook, the modest quest of bringing formidable, lesser-known machine learning projects to a few additional sets of eyes. Previous lists have included both general purpose and specialized machine learning and deep learning libraries, along with auxiliary support, data cleaning, and automation tools.

This time around we showcase 5 more machine learning-related projects which you may not yet heard of, including those from across a number of different ecosystems and programming languages. You may find that, even if you have no requirement for any of these particular tools, inspecting their broad implementation details or their specific code may help in generating some ideas of your own. Like the previous iteration, there is no formal criteria for inclusion beyond projects that have caught my eye over time spent online, and the projects have Github repositories. Yes, it's subjective, but there is no qualitative approach to this task that would make any sense.

Without further ado, here they are: yet another 5 machine learning projects you should consider having a look at. They are in no order, but are numbered to appear as though they are, since numbering things makes me feel warm and fuzzy.

1. Vectorflow

Last month, Netflix open-sourced Vectorflow, its in-house deep learning library. Vectorflow is written in D, and is implemented with sparsity, agility, and collective single-machine (as in, the opposite of distributed) processing. Vectorflow looks to be an interesting machine learning project for those in the D ecosystem.

The link above is to a blog post introducing the project, which I recommend you consult for further information. Here is a link to Vectorflow's Github page.


2. Optimus

Are you interested in simplifying your data cleaning in the Spark environment? Sure you are!

The link above is to a blog post introducing Optimus, a library for accomplishing just that. I asked project lead, Favio Vázquez, to give me the one sentence overview, and he was good enough to provide me with the following.

Optimus is a data cleansing and exploratory data analysis framework built over Apache Spark that implements several handy tools for data wrangling, plotting and preparation in a distributed fashion that will work on your laptop or your big cluster, besides it is amazingly easy to install, use and understand.


Read the introductory blog post to find out more. Check out this notebook for an example of how Optimus can be used. Click here to see the project's Github page.

3. deeplearn.js

deeplearn.js brings neural network training to the browser, and allows for the running of pre-trained models in inference mode. From the project website:

We provide two APIs, an immediate execution model (think NumPy) and a deferred execution model mirroring the TensorFlow API.

deeplearn.js was originally developed by the Google Brain PAIR team to build powerful interactive machine learning tools for the browser. You can use the library for everything from education, to model understanding, to art projects.

Check out the deeplearn.js project on Github.

4. Facets

Facets is a machine learning dataset visualization library. Directly from the project's website:

The power of machine learning comes from its ability to learn patterns from large amounts of data. Understanding your data is critical to building a powerful machine learning system.

Facets contains two robust visualizations to aid in understanding and analyzing machine learning datasets. Get a sense of the shape of each feature of your dataset using Facets Overview, or explore individual observations using Facets Dive.

Check out Facets on Github.

Facets image

5. skdata

One of the most common questions I get from people is "where do I get datasets?" It's a good question -- since we need data to perform any machine learning -- and data generally does not fall into our laps.

This project helps take care of this, at least for Python-based machine learning, and for practice. Tired of using iris for everything? skdata has got you covered.

skdata is a library of data sets for machine learning experiments, with modules that

  1. download data sets,
  2. load them as directly as possible as Python data structures, and
  3. provide protocols for machine learning tasks via convenient views.

See the list of datasets here. Check out the project's website here. Find skdata on Github here.