KDnuggets Home » News » 2018 » Feb » Tutorials, Overviews » 5 Machine Learning Projects You Should Not Overlook ( 18:n07 )

Silver Blog5 Machine Learning Projects You Should Not Overlook

It's about that time again... 5 more machine learning or machine learning-related projects you may not yet have heard of, but may want to consider checking out!

After a hiatus, the "Overlook..." posts are making their comeback this month, continuing the modest quest of bringing formidable, lesser-known machine learning projects to a few additional sets of eyes.

Check out the 5 projects below for some potential fresh machine learning ideas.

1. skift: scikit-learn wrappers for Python fastText

What is skift?

skift includes several scikit-learn-compatible wrappers for the fastText Python package which cater to these use cases.

What is fastText?

fastText is a library for efficient learning of word representations and sentence classification.

fastText works only on text data, which means that it will only use a single column from a dataset which might contain many feature columns of different types. As such, a common use case is to have the fastText classifier use a single column as input, ignoring other columns. This is especially true when fastText is to be used as one of several classifiers in a stacking classifier, with other classifiers using non-textual features.

Understanding fastText is the important piece of the puzzle, but once this understanding is possessed, skift helps you easily implement fastText, as well as integrate it with other Scikit-learn functionality in general.

>>> from skift import FirstColFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = FirstColFtClassifier(lr=0.3, epoch=10)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])


2. PHP-ML: Machine Learning library for PHP

Tired of not having decent machine learning alternatives for PHP? Are you a masochist (if you're using PHP this answers itself)? Well then, this project just may be for you!

Fresh approach to Machine Learning in PHP. Algorithms, Cross Validation, Neural Network, Preprocessing, Feature Extraction and much more in one library.

While I kid, I am far enough removed from the PHP world not to know whether this serves any particular pressing requirement; the 5K+ stars would suggest that it likely does! Beyond that, I'm always interested in seeing how machine learning ecosystems unfold in different programming language environments. Perhaps you are too, or more importantly you may actually have a use for what seems at preliminary glance to be a solid library for the PHP people out there.


3. Keras Scikit-Learn API Wrappers

While this is not technically its own project, I find it important enough to highlight here.

You can use Sequential Keras models (single-input only) as part of your Scikit-Learn workflow via the wrappers found at keras.wrappers.scikit_learn.py.

Similar to how the underlying project was the most important to understand for skift (above), the important piece of this puzzle is having an understanding of implementing neural networks with Keras, itself a high-level API. Being able to integrate Keras with additional Scikit-learn functionality, and being able to use the familiar API and methods, is what these wrappers accomplish. Find the API on the official Keras Github repository.

If you are already using Keras, there is a good chance this is not new to you. If you aren't, knowing that this integration is possible may be enough to have you take a look.


4. CatBoost: machine learning method based on gradient boosting over decision trees

Gradient boosting continues to be all the rage. Or some of the rage, at the very least. A recent entrant into the gradient boosted trees arena is CatBoost.

Main advantages of CatBoost:

  • Superior quality when compared with other GBDT libraries.
  • Best in class inference speed.
  • Support for both numerical and categorical features.
  • Fast GPU and multi-GPU (on one node) support for training.
  • Data visualization tools included.

CatBoost is available in Python, R, and command-line interface flavors. Check out tutorials here, and much more in its full documentation here.


5. PyMC3: Probabilistic Programming in Python

PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning which focuses on advanced Markov chain Monte Carlo and variational fitting algorithms. Its flexibility and extensibility make it applicable to a large suite of problems.

PyMC3 sits atop Theano, which provides:

  • Computation optimization and dynamic C compilation
  • Numpy broadcasting and advanced indexing
  • Linear algebra operators
  • Simple extensibility

...and more. You can check out the getting started guide here and the API quick start guide here.