2016 Silver BlogScikit Flow: Easy Deep Learning with TensorFlow and Scikit-learn

Scikit Learn is a new easy-to-use interface for TensorFlow from Google based on the Scikit-learn fit/predict model. Does it succeed in making deep learning more accessible?



Google's TensorFlow has been publicly available since November, 2015, and there is no disputing that, in a few short months, it has made an impact on machine learning in general, and on deep learning specifically. There is evidence of widespread acceptance via blog posts, academic papers, and tutorials all over the web.

It is, of course, difficult to estimate true adoption rates, but TensorFlow's Github repository has nearly twice the number of stars of both the next most-starred machine learning project, Scikit-learn, and closest deep learning project, Berkeley Vision and Learning Center's Caffe. While not concretely indicative of TensorFlow having become the leader in the space, it is fairly easy to surmise that, given its fairly recent release, there has been considerable interest in, and use of, Google's deep learning library.

TensorFlow

For the most part, TensorFlow is relatively straightforward to use, and neural network afficianados without experience using the library could look at a given network's code and get an intuititive sense of what is going on. Syntax could likely be more to-the-point and concise, without the use of any wrappers, but there is a clear reason why it is not. Technically, TensorFlow is "an open source software library for numerical computation using data flow graphs," and while it is (predominantly) used for machine learning and deep learning research (and production), the system is general enough so that it is applicable to a wide array of additional domains. If TensorFlow were any more deep learning-friendly, this specificity would detract from these potential additional uses.

# A simple Hello World! using TensorFlow
import tensorflow as tf

hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
sess.run(hello)

# -> Hello, TensorFlow!


Yet, one of the reasons why so many machine learning researchers and practitioners use Python, the language through which the TensorFlow library API is generally accessed, is because of its rapid prototyping abilities. TensorFlow doesn't necessarily prohibit this quick turnaround time, per se, yet there is a learning curve of some sort that comes along with it, especially if one is unfamiliar with other similar libraries, such as Theano, for example.

But what if you could pick up TensorFlow and get to training neural networks almost immediately, with no concern for learning any additional syntax or configuration? That's where Scikit Flow comes in. However, I will first digress momentarily.

Scikit-learn + TensorFlow = Scikit Flow

Scikit-learn has a rich history as the de facto official Python general machine learning framework. While I'm sure that sentence will (and can) be disputed, and maybe it is a bit strong, there is no denying that Scikit-learn has a prominent place in the Python machine learning ecosystem, and in the discipline of machine learning in general.

Scikit-learn classifiers

And its ease of use and standardized interface have something to do with that. For example, Scikit-learn makes use of a simple fit/predict workflow model for its classification algorithms. This makes building, training, and testing models incredibly easy. The relevant code of a typical logistic regression model test/train might look like this:

from sklearn.linear_model import LogisticRegression
from sklearn import datasets, metrics

iris = datasets.load_iris()
classifier = LogisticRegression()
classifier.fit(iris.data, iris.target)
score = metrics.accuracy_score(iris.target, classifier.predict(iris.data))
print("Accuracy: %f" % score)


Want to try a Naive Bayes classifier? That doesn't require much of a change:

from sklearn.naive_bayes import GaussianNB
from sklearn import datasets, metrics

iris = datasets.load_iris()
classifier = GaussianNB()
classifier.fit(iris.data, iris.target)
score = metrics.accuracy_score(iris.target, classifier.predict(iris.data))
print("Accuracy: %f" % score)


The only changes were the import statement on the first line and the classifier instantiation statement. Given this, we can easily see the uniformity and conciseness of Scikit's model interface. Even if you weren't aware of it before reading this you already get it, since there is nothing to it. And while there is, of course, more to machine learning pipelines than the 7 lines of code in the above excerpts, those 7 lines cover a large and important aspect of it, and covers it the same regardless of classifier.

And now back to Scikit Flow (skflow): Since (almost) everyone in the Python machine learning ecosystem has some knowledge of Scikit-learn, what if you could immediately harness the modelling power of TensorFlow by channelling the syntactical brevity of Scikit-learn? Scikit Flow (the very name name alone alludes to this harnessing and channelling) is officially billed as follows:

This is a simplified interface for TensorFlow, to get people started on predictive analytics and data mining.

Practically, and more explicitly, Scikit Flow is a high level wrapper for the TensorFlow deep learning library, which allows the training and fitting of neural networks using the brief, familiar approach of Scikit-learn.

To answer the question, "Why Scikit Flow?", its repository README explains:

To smooth the transition from the Scikit Learn world of one-liner machine learning into the more open world of building different shapes of ML models. You can start by using fit/predict and slide into TensorFlow APIs as you are getting comfortable.

Scikit-learn + TensorFlow = Scikit Flow

Importantly, Scikit Flow is an official TensorFlow project coming out of Google; it's not a hacked third party solution... not that there's anything wrong with that. At all. But the fact that Google has developed, released, and backed this project should give you the confidence you need that it will allow the 2 libraries to work in concert as promised. It's popular, too; at the time of writing, the Scikit Flow repo has nearly 1700 stars of its very own.

Discussion

We will now take a look at a few examples. If you want to play along at home, first ensure that you have the following installed:

  • Python: 2.7, 3.4+
  • Scikit learn: 0.16, 0.17, 0.18+
  • Tensorflow: 0.6+

Scikit Flow is easily installed using pip with the following single line of code:

>>> pip install git+git://github.com/tensorflow/skflow.git


To start out, we will first look at implementing a generic linear classifier in Scikit Flow.

import skflow
from sklearn import datasets, metrics

iris = datasets.load_iris()
classifier = skflow.TensorFlowLinearClassifier(n_classes=3)
classifier.fit(iris.data, iris.target)
score = metrics.accuracy_score(iris.target, classifier.predict(iris.data))
print("Accuracy: %f" % score)


As is evident, the above example follows the similar fit/predict model of Scikit-learn. If you look at the earlier Scikit-learn models, you will notice their similarity to the above.

But that's only a linear classifier, not real deep learning. With deep neural networks is where we can see the real power of Scikit Flow. A generic 3 layer neural network with 10, 20, and 10 hidden nodes can be easily coded as follows:

import skflow
from sklearn import datasets, metrics

iris = datasets.load_iris()
classifier = skflow.TensorFlowDNNClassifier(hidden_units=[10, 20, 10], n_classes=3)
classifier.fit(iris.data, iris.target)
score = metrics.accuracy_score(iris.target, classifier.predict(iris.data))
print("Accuracy: %f" % score)


Again, very little has changed. Instead of using the TensorFlowLinearClassifier from the immediately previous example, we have instead used the TensorFlowDNNClassifier, which has allowed us to build, train and test a deep neural classifier in 7 lines of (heavily-assisted) code. We have only explicitly specified the number of nodes and the number of hidden layers. Scikit Flow also has a stock recurrent neural network, some additional classifiers, and as an early work and one of the official TensorFlow projects, one could assume additional stock architectures and classifiers will soon be added.

For an almost apples to apples comparison, check out the Scikit Flow and "raw" TensorFlow implementations of MNIST image classifiers. There are also many more examples in the Github repo (including an interesting one which interfaces with Dask, the parallel processing engine, for out-of-core data classification).

Scikit Flow also allows for mixed-in interaction between low level TensorFlow. For those interested in creating architectures at a lower level and then training and testing them via a high level interface, Scikit Flow could conceivably be a good fit. It may also assist with the distributability of deep architectures; when sharing architectures created at a low level, providing the familiar Scikit-learn interface for others to train and test with may not be a bad idea, circumstance-dependent, of course.

Conclusion

While skflow may not provide the flexibility of "raw" TensorFlow, the high level abstraction allows for the rapid prototyping of neural networks. It also allows newcomers to deep learning and TensorFlow become productive almost immediately. Given that TensorFlow code can still be written alongside, there is an opportunity to mix code and provide even greater flexibility when required.

That Scikit Flow may find a niche in other circumstances, such as model-sharing or managing the training and testing of lower-level networks, it seems as though Google has produced a well-conceived addition to TensorFlow, an addition that certainly will not hinder its further adoption.

Update: A new Reddit post from the developers of Scikit Flow is soliciting input on features to add. Have an idea? Drop a comment over there.

Bio: Matthew Mayo is a computer science graduate student currently working on his thesis parallelizing machine learning algorithms. He is also a student of data mining, a data enthusiast, and an aspiring machine learning scientist.

Related: