Semi-supervised Feature Transfer: The Practical Benefit of Deep Learning Today?

This post evaluates four different strategies for solving a problem with machine learning, where customized models built from semi-supervised "deep" features using transfer learning outperform models built from scratch, and rival state-of-the-art methods.

Strategy D: Custom model using features from a deep neural network trained to predict sentiment.

When context matters, and "bag of words" models aren't good enough, we need features that understand how words are composed into various patterns of language. Using indico's Custom Collection API with the "sentiment" domain flag (see “Creating a Collection” in the indico docs, we can use feature vectors from a pre-trained recurrent neural network to train a new custom model using a small number of labeled examples.

ML features: fixed-length vector projected from recurrent weights of a deep neural network trained to predict sentiment

Economics & practical factors:

  • How much data required to train a good model?
    • Even a few dozen examples gives surprisingly strong performance, which scales with the addition of more data.
  • How much effort to train?
    • Moderate. You’ll need to have some labeled training examples, but the complexity of model training and hyperparameter optimization are handled by the API. Training is identical to Strategy C, this time using a simple keyword argument to specify sentiment features.
  • How much effort to deploy?
    • Pretty minimal. The API has a well-documented REST endpoint and client libraries in most common programming languages (Python, Ruby, node, Java, R, etc).
  • Other sticking points?
    • Requires a network connection.


lick to see how the curve changes when the model is trained on 25, 100, 500, 1000, 2000, 4000 examples. All models tested on 4000 examples.

Comparing models by performance metrics

Performance results


Click to see how the curve changes when the model is trained on 25, 100, 500, 1000, 2000, 4000 examples. All models tested on 4000 examples.

It should be no surprise that Strategy A, the sklearn “built from scratch” model, struggles to do well when only a few examples are used for training. But we know that n-grams into TF-IDF features should capture many aspects of sentiment in English text, and indeed, as training data are increased into thousands of examples, the model becomes viable. Ultimately, the sklearn model maxes out at around 90-91% accuracy, regardless of the amount of data available.

Strategy B, the pre-built “sentiment HQ” model, requires no training data. So as expected, it performs similarly for all tests, with any tiny differences entirely due to variations in which examples were randomly selected for the test. Performance is very strong, but this API is built specifically for sentiment analysis, so it won’t help you with other text problems.

The custom-trained API using general text features, Strategy C, performs better than the sklearn model when using very few input data, but because these general text features only represent word-level information, the model has limited capacity to improve as training data are increased. We see similar performance for the sklearn model, which also uses word-level features. However, the sklearn model uses single words in addition to bigrams and trigrams, and these additional features allow it to improve beyond the word-level features of Strategy C.

Strategy D, the custom-trained API using sentiment features from a deep neural network, performs the best of all trained models, matching performance of the pre-built “sentiment HQ” model when some labeled training examples are available.

To move beyond word-level performance we need features that understand context. Both the pre-built Sentiment HQ API and the custom model trained using sentiment features are significant improvements beyond the word-level performance. In fact, both models achieve state-of-the-art performance in terms of accuracy, beating the previous best published accuracy of 92.76% on this dataset from Andrew Dai and Quoc Le in November 2015.

It is also worth noting that the Sentiment HQ model (Strategy B) was trained specifically for sentiment prediction using a different data distribution---online reviews which did not include IMDB or the Large Movie Review Database (remember, this is the dataset we used to train the other models). Thus, the Custom Collection has the somewhat easier challenge of predicting on new IMDB reviews from other IMDB reviews. A classic illustration of how small differences in evaluation metrics are often less significant than understanding similarities in data distributions and task alignment among various models! Task and data alignment are key challenges in machine learning, and become especially important when transferring pre-trained features into a new model.

About: indico creates powerful machine learning APIs for analyzing images and text. Using our latest product (Strategies C and D above) anyone can train and deploy custom models using rich feature embeddings from deep neural networks with just a few lines of code. Follow us on Twitter: @djkust, @indicoData. Visit indico’s blog for great articles on machine learning, deep learning, and more: