Open Source Tools for Machine Learning

Open source machine learning software makes it easier to implement machine learning solutions on single computers and at scale, and the diversity of packages provide more options for implementers.

Accord Framework/
Machine Learning Accord, a machine learning and signal processing framework for .Net, is an extension of a previous project in the same vein, A set of algorithms for vision processing are included; it operates on image streams (such as video) and can be used to implement such functions as the tracking of moving objects. Accord also includes libraries that provide a more conventional gamut of machine learning functions, from neural networks to decision-tree systems.

Cloudera Oryx
Yet another machine learning project designed for Hadoop, Oryx comes courtesy of the creators of the Cloudera Hadoop distribution. The name on the label isn’t the only detail that sets Oryx apart: Per Cloudera’s emphasis on analyzing live streaming data by way of the Spark project, Oryx is designed to allow machine learning models to be deployed on real-time streamed data, enabling projects like real-time spam filters or recommendation engines.

As the name implies, ConvNetJS provides neural network machine learning libraries for use in JavaScript, facilitating use of the browser as a data workbench. An NPM version is also available for those using Node.js.

By now most everyone knows how GPUs can crunch certain problems faster than CPUs. But applications don’t automatically take advantage of GPU acceleration; they have to be specifically written to do so. CUDA-Convnet is a machine learning library for neural-network applications, written in C++ to exploit the Nvidia’s CUDA GPU processing technology (CUDA boards of at least the Fermi generation are required).

Google’s Go language has been in the wild for only five years, but has started to enjoy wider use, due to a growing collection of libraries. GoLearn was created to address the lack of an all-in-one machine learning library for Go; the goal is “simplicity paired with customizability,” according to developer Stephen Witworth.

0xdata’s H2O's algorithms are geared for business processes -- fraud or trend predictions, for instance -- rather than, say, image analysis. H2O can interact in a stand-alone fashion with HDFS stores, on top of YARN, in MapReduce, or directly in an Amazon EC2 instance.

Mahout Apache
The Mahout framework has long been tied to Hadoop, but many of the algorithms under its umbrella can also run as-is outside Hadoop. They're useful for stand-alone applications that might eventually be migrated into Hadoop or for Hadoop projects that could be spun off into their own stand-alone applications.

Apache’s own machine learning library for Spark and Hadoop, MLlib boasts a gamut of common algorithms and useful data types, designed to run at speed and scale. As you’d expect with any Hadoop project, Java is the primary language for working in MLlib, but Python users can connect MLlib with the NumPy library (also used in scikit-learn), and Scala users can write code against MLlib.

Python has become a go-to programming language for math, science, and statistics due to its ease of adoption and the breadth of libraries available for nearly any application. Scikit-learn leverages this breadth by building on top of several existing Python packages -- NumPy, SciPy, and matplotlib -- for math and science work. The resulting libraries can be used either for interactive “workbench” applications or be embedded into other software and reused.

Among the oldest, most venerable of machine learning libraries, Shogun was created in 1999 and written in C++, but isn’t limited to working in C++. Thanks to the SWIG library, Shogun can be used transparently in such languages and environments: as Java, Python, C#, Ruby, R, Lua, Octave, and Matlab.

Weka, a product of the University of Waikato, New Zealand, collects a set of Java machine learning algorithms engineered specifically for data mining. This GNU GPLv3-licensed collection has a package system to extend its functionality, with both official and unofficial packages available.