Why you need to improve your training data, and how to do it
This article examines the way you need to improve your training data and how it can be accomplished, including speech commands, choosing the right data, picking a model fast and more.
By Pete Warden.
Andrej Karpathy showed this slide as part of his talk at Train AI and I loved it! It captures the difference between deep learning research and production perfectly. Academic papers are almost entirely focused on new and improved models, with datasets usually chosen from a small set of public archives. Everyone I know who uses deep learning as part of an actual application spends most of their time worrying about the training data instead.
There are lots of good reasons why researchers are so fixated on model architectures, but it does mean that there are very few resources available to guide people who are focused on deploying machine learning in production. To address that, my talk at the conference was on “the unreasonable effectiveness of training data”, and I want to expand on that a bit in this blog post, explaining why data is so important along with some practical tips on improving it.
As part of my job I work closely with a lot of researchers and product teams, and my belief in the power of data improvements comes from the massive gains I’ve seen them achieve when they concentrate on that side of their model building. The biggest barrier to using deep learning in most applications is getting high enough accuracy in the real world, and improving the training set is the fastest route I’ve seen to accuracy improvements. Even if you’re blocked on other constraints like latency or storage size, increasing accuracy on a particular model lets you trade some of it off for those performance characteristics by using a smaller architecture.
I can’t share most of my observations of production systems, but I do have an open source example that demonstrates the same pattern. Last year I created a simple speech recognition example for TensorFlow, and it turned out that there was no existing dataset that I could easily use for training models. With the generous help of a lot of volunteers I collected 60,000 one-second audio clips of people speaking short words, thanks to the Open Speech Recording site the AIY team helped me launch. The resulting model was usable, but not as accurate as I’d like. To see how much of that was to do with my own limitations as a model designer, I ran a Kaggle competition using the same dataset. The competitors did much better than my naive models, but even with a lot of different approaches multiple teams came to within a fraction of a percent of 91% accuracy. To me this implied that there was something fundamentally wrong with the data, and indeed competitors uncovered a lot of errors like incorrect labels or truncated audio. This gave me the impetus to focus on a new release of the dataset with the problems they’d uncovered fixed, along with more samples.
I looked at the error metrics to understand what words the models were having the most problems with, and it turned out that the “Other” category (when speech was recognized, but the words weren’t within the model’s limited vocabulary) was particularly error-prone. To address that, I increased the number of different words that we were capturing, to provide more variety in training data.
Since the Kaggle contestants had reported labeling errors, I crowd-sourced an extra verification pass, asking people to listen to each clip and ensure that it matched the expected label. Because Kaggle had also uncovered some nearly silent or truncated files, I also wrote a utility to do some simple audio analysis and weed out particularly bad samples automatically. Finally, I increased the total number of utterances to over 100,000, despite removing bad files, thanks to the efforts of more volunteers and some paid crowd-sourcing.
To help others use the dataset (and learn from my mistakes!) I wrote everything relevant up in an Arxiv paper, along with updated accuracy results. The most important conclusion was that, without changing the model or test data at all, the top-one accuracy increased by over 4%, from 85.4% to 89.7%. This was a dramatic improvement, and was reflected in much higher satisfaction when people used the model in the Android or Raspberry Pi demo applications. I’m confident I would have achieved a much lower improvement if I’d spent the time on model adjustments, even though I’m currently using an architecture that I know is behind the state of the art.
This is the sort of process that I’ve seen produce great results again and again in production settings, but it can be hard to know where to start if you want to do the same thing. You can get some idea from the kind of techniques I used on the speech data, but to be more explicit, here are some approaches that I’ve found useful.
First, Look at Your Data
It may seem obvious, but your very first step should be to randomly browse through the training data you’re starting with. Copy some of the files onto your local machine, and spend a few hours previewing them. If you’re working with images, use something like MacOS’s finder to scroll through thumbnail views and you’ll be able to check out thousands very quickly. For audio, use the finder to play previews, or for text dump random snippets into your terminal. I didn’t spend enough time doing this for the first version speech commands which is why so many problems were uncovered by Kaggle contestants once they started working with the data.
I always feels a bit silly going through this process, but I’ve never regretted it afterwards. Every time I’ve done it, I’ve discovered something critically important about the data, whether it’s an unbalanced number of examples in different categories, corrupted data (for example PNGs labeled with JPG file extensions), incorrect labels, or just surprising combinations. Tom White has made some wonderful discoveries in ImageNet using inspection, including the “Sunglass” label actually referring to an archaic device for magnifying sunlight, glamor shots for “garbage truck”, and a bias towards undead women for “cloak”. Andrej’s work manually classifying photos from ImageNet taught me a lot about the dataset too, including how hard it is to tell all the different dog breeds apart, even for a person.
What action you’ll take depends on what you find, but you should always do this kind of inspection before you do any other data cleanup, since an intuitive knowledge of what’s in the set will help you make decisions on the rest of the steps.
Pick a Model Fast
Don’t spend very long choosing a model. If you’re doing image classification, check out AutoML, otherwise look at something like TensorFlow’s model repository or Fast.AI’s collection of examples to find a model that’s solving a similar problem to your product. The important thing is to begin iterating as quickly as possible, so you can try out your model with real users early and often. You’ll always be able to swap out an improved model down the road, and maybe see better results, but you have to get the data right first. Deep learning still obeys the fundamental computing law of “garbage in, garbage out”, so even the best model will be limited by flaws in your training set. By picking a model and testing it, you’ll be able to understand what those flaws are and start improving them.
To speed up your iteration speed even more, try to start with a model that’s been pre-trained on a large existing dataset and use transfer learning to fine-tune it with the (probably much smaller) set of data you’ve gathered. This usually gives much better results than training only on your smaller dataset, and is much faster, so you can quickly get a feel for how you need to adjust your data gathering strategy. The most important thing is that you are able to incorporate feedback from your results into your collection process, to adapt it as you learn, rather than running collection as a separate phase before training.
Next, we examine why you need to fake it before you make it.