Choosing an Open Source Machine Learning Library: TensorFlow, Theano, Torch, scikit-learn, Caffe

Open Source is the heart of innovation and rapid evolution of technologies, these days. Here we discuss how to choose open source machine learning tools for different use cases.

By AltexSoft.

From healthcare and security to marketing personalization, despite being at the early stages of development, machine learning has been changing the way we use technology to solve business challenges and everyday tasks. This potential has prompted companies to start looking at machine learning as a relevant opportunity rather than a distant, unattainable virtue.

We’ve already discussed machine learning as a service tools for your ML projects. But now let’s look at free and open source software that allows everyone to board the machine learning train without spending time and resources on infrastructure support.

Why Open Source Machine Learning?

The term open source software refers to a tool with a source code available via the Internet for free. Proprietary (closed source) software code is private and distributed via licensed rights. For a business that’s just starting its ML initiative, using open source tools can be a great way to practice data science gratis before deciding on enterprise level tools like Microsoft Azure or Amazon Machine Learning.

The benefits of using open source tools don’t stop at their availability. Generally, such projects have a vast community of data engineers and data scientists eager to share datasets and pre-trained models. For instance, instead of building image recognition from scratch, you can use classification models trained on the data from ImageNet, or build your own using this dataset. Open source ML tools also let you leverage transfer learning, meaning solving machine learning problems by applying knowledge gained after working with a problem from a related or even distant domain. So, you can transfer some capacities form the model that has learned to recognize cars to the model aimed at trucks recognition.

Depending on the task you’re working with, pre-trained models and open datasets may not be as accurate as custom ones, but they will save a substantial amount of effort and time, and they don’t require you to gather datasets. According to Andrew Ng, former chief scientist at Baidu and professor at Stanford, the concept of reusing open source models and datasets will be the second biggest driver of the commercial ML success after supervised learning.

Comparing GitHub commits and contributors for different open source tools

Among many active and less popular open source tools, we’ve picked five to explore in depth to help you find the one to start you on the road to data science experimentation. Let’s begin.

TensorFlow: Profound and Favored Tool from Google

Originally built by Google for internal use, TensorFlow was released under an Apache 2.0 open source license in 2015. The library is still used by the corporation for a number of services, such as speech recognition, Photo Search, and automatic responses for Gmail’s Inbox. Google’s reputation and useful flowgraphs to construct models have attracted a massive number of contributors to TensorFlow. This resulted in public access to exhaustive documentation and tutorials allowing for an easy entrance point into the world of neural networks applications.

TensorFlow is a great Python tool for both deep neural networks research and complex mathematical computations, and it can even support reinforcement learning. The uniqueness of TensorFlow also lies in dataflow graphs – structures that consist of nodes (mathematical operations) and edges (numerical arrays or tensors).

Dataflow graphs allow you to create a visual representation of data flow between operations and then execute calculations
Source: TensorFlow

Datasets and models

The flexibility of TensorFlow is based on the possibility of using it both for research and recurring machine learning tasks. Thus, you can use the low level API called TensorFlow Core. It allows you to have full control over models and train them using your own dataset. But there are also public and official pre-trained models to build higher level APIs on top of TensorFlow Core. Some of the popular models you can apply are MNIST, a traditional dataset helping identify handwritten digits on an image, or Medicare Data, a dataset by Google used to predict charges for medical services among others.

Audience and learning curve

For someone exploring machine learning for the first time, TensorFlow’s variety of functions may be a bit of a struggle. Some even argue that the library doesn’t try to accelerate a machine learning curve, instead making it even steeper. TensorFlow is a low-level library that requires ample code writing and a good understanding of data science specifics to start successfully working with the product. Consequently, it may not be your first choice if your data science team is IT-centric and there: There are simpler alternatives we’ll be discussing.

Use cases

Considering its complexity, the use cases for TensorFlow mostly include solutions by large companies with access to machine learning specialists. For example, the British online supermarket Ocado applied TensorFlow to prioritize emails coming to their contact center and improve demand forecasting. Also, the global insurance company Axa used the library to predict large-loss car incidents involving their clients.

Theano: Mature Library with Extended Possibilities

Theano is a low-level library for scientific computing based on Python, which is used to target deep learning tasks related to defining, optimizing, and evaluating mathematical expressions. While it has an impressive computing performance, users complain about an inaccessible interface and unhelpful error messages. For these reasons, Theano is mainly applied in combination with more user-friendly wrappers, such as KerasLasagne, and Blocks – three high-level frameworks aimed at fast prototyping and model testing.

Datasets and models

There are public models for Theano, but each framework used on top also has plenty of tutorials and pre-trained datasets to choose from. Keras, for instance, stores available models and detailed usage tutorials in its documentation.

Audience and learning curve

If you use Lasagne or Keras as high-level wrappers on top of Theano, again you’ll have a multitude of tutorials and pre-trained datasets at your fingertips. Moreover, Keras is considered one of the easiest libraries to start with at early stages of deep learning exploration.

Since TensorFlow was designed to replace Theano, a big part of its fanbase left. But there are still a lot of advantages that many data scientists find compelling enough keep them with an outdated version. Theano’s simplicity and maturation are serious points to consider when making this choice.

Use cases

Considered an industry standard for deep learning research and development, Theano was originally designed to implement state-of-the-art deep learning algorithms. However, considering that you probably won’t use Theano directly, its numerous uses expand as you use it as foundation for other libraries: digit and image recognition, object localization, and even chatbots.