Semi-supervised Feature Transfer: The Practical Benefit of Deep Learning Today?

This post evaluates four different strategies for solving a problem with machine learning, where customized models built from semi-supervised "deep" features using transfer learning outperform models built from scratch, and rival state-of-the-art methods.

By Daniel Kuster, indico.

Transfer learning

What is transfer learning?

Transfer learning is the concept of training a model to solve one problem, and then using the knowledge it learned to help solve other problems. In practice, we pre-train a deep neural network using large datasets (many millions of examples), so that it learns generally useful feature representations. We can then copy those internal feature representations into new models, effectively transferring in knowledge from the deep model.

What is semi-supervised feature transfer?

Do you remember learning to speak your first language? I don't, but as a father, it is fascinating to see my children do it! How is the process of learning a second language different from learning the first one? When children learn their first language, they’re simultaneously learning how to reason about things in the world, and also how to compose those ideas using language. Learning an additional language is easier because we already have lots of knowledge about things in the world---we need only relearn how to compose the new language.

It is similar with transfer learning. By pre-training a model on large datasets, we are teaching the model to learn useful features from the original data domain. But when we transfer that knowledge to a new task, even though these features describe may encode a lot of information about the world, the new model still has no way to know how that information relates to the new task or data domain. By providing a relatively small number of labeled examples about the new domain, it becomes possible to map the feature representations into the new task. So the "knowledge" in the new model is a combination of transferred features plus a few labeled examples. This is semi-supervised feature transfer.

Wait, how does deep learning make it work?

If, like many readers, you were aware of machine learning methods before the deep learning hype train rolled into the station, you might be wondering, "does this actually work?" The best answer is to simply try it for yourself on some familiar problem and data. Code is available, so you can easily compare models. In fact, indico provides 10k free API calls per month to encourage experimentation exactly like this.

So what is it about deep learning that makes features transferable? Depth, hierarchy, and joint training. Traditional machine learning features don't transfer well from one model to another because there are not enough constraints to enforce transferability during training. Also, these features tend to be manually engineered, rather than learned as part of the training/optimization.

Deep neural network architectures are built of layers upon layers, and therefore can learn to compose hierarchical features where the inputs to one layer are a function of the layers below it. This hierarchical structure of fine-grained features feeding into coarse-grained features is more likely to capture invariant features across data distributions, viewer perspective, language style, etc. When a deep neural network is trained to learn features jointly, instead of manually engineered features (as in traditional machine learning methods), it can produce transferable features.

To build indico's Custom Collections, we trained deep neural networks on very large and diverse data sets to produce features that are generally useful for image or text. So when you provide a few examples of labeled data to train a custom model on top of these features, you are informing the model about how those features can be combined to predict the specific outputs you have labeled.

Case study: sentiment analysis

For each strategy, we'll quickly summarize the steps to train and deploy a sentiment analysis model  using movie reviews (for more details, remember to review the code itself). Then we'll show model performance metrics and compare the human effort, economics, and amount of training data required for each machine learning strategy.

Common steps:

We cover four different strategies for solving the sentiment prediction problem using machine learning. The overall workflow is the same for each strategy -- steps 1-4 and 8-9 are identical (regardless of model) and steps 5-7 depend on the model:

  1. Frame the problem as a machine learning task. Here we want to build a binary classification model for sentiment. Given a chunk of text, the model should predict "positive" or "negative".
  2. Choose metrics. To quantify classifier performance, we'll use accuracy, the Receiver Operator Characteristic (ROC), and the Area Under the ROC Curve metrics.
  3. Get data. Download the [Large Movie Review Database]( It contains 50,000 movie reviews from IMDB, labeled with star ratings. The authors have already converted the star ratings into positive and negative labels.
  4. Prepare data to be ingested by your modeling pipeline. Normally we'd split the data into training, development, and validation sets. However, for these movie reviews, the authors have already created a standardized train/test split. We'll use the pre-defined splits for fair apples-to-apples comparison against published results. Alternatively, you could combine the data to make your own splits, use your favorite cross-validation method, etc. For API-based models, this includes sending training examples to the remote endpoint.
  5. Compute features (depends on model).
  6. Train a classifier (depends on features + model).
  7. Predict labels for each example in the test data. For API-based models, getting a prediction also includes sending a request to the remote API endpoint.
  8. Evaluate metrics over the predictions.
  9. Deploy the model so that your product or users can get predictions.