4 Tips for Dataset Curation for NLP Projects

You have heard it before, and you will hear it again. It's all about the data. Curating the right data is also so important than just curating any data. When dealing with text data, many hard-earned lessons have been learned by others over the years, and here are four data curation tips that you should be sure to follow during your next NLP project.

By Paul Barba, Chief Scientist, Lexalytics.

Lexalytics 4 Tips Dataset Curation

After many years of painfully learned lessons from managing and implementing AI and ML projects, I’ve come to believe that the single most crucial piece of the puzzle is choosing the right dataset for the problem at hand, especially when it comes to a text or NLP problem.

While the algorithm used and its parameters are essential, these aspects are relatively easy to change when necessary -- and the drop in cost for machine time in recent years means that fixes are increasingly less expensive to make.

However, human time is inherently expensive, and when the data isn’t useful, it's a shameful waste of resources. With that in mind, I’ve outlined four tips, with examples and anecdotes, on how to curate the ultimate dataset for NLP and text analysis.


Tip #1:


The most important lesson is to start small, as small as you can. Try to train a model as soon as possible, even with as few as 50 to 100 examples, or even just a couple when using zero-shot approaches.

If you want to have a lot of classes in your NER models, start with one category, get it marked up, and see how it’s doing, because the earlier you catch any issue, the easier it is to fix.

One anecdote related to this tip: Early in 2014, like a lot of companies around that time, we built a machine learning framework, and our marketing department thought it would be a good idea to use it to try to address our ever-burgeoning spam issue.

Suddenly, dozens of expensive machines were running on Amazon trying to clean up common crawl, and the model wasn’t working. We’d burned through a lot of cash, and the people in the company holding the purse strings weren’t very pleased. In hindsight, this was a project where there were dozens of opportunities to go small and explainable and catch any early issues with the model without needing a machine to do all the work.


Tip #2:


Use datasets that are representative of the real world. When just getting started, data scientists can use whatever is available. SuperGLUE is one example; for sentiment, the IMDB dataset is fairly standard; for entities, there's CoNLL ‘03.

However, when solving a specific business domain problem and looking to push the state-of-the-art forward, it’s important to consider whether the data represents what the model will be applied to for that specific business problem.

An anecdote: We had a prospect that came to us with reams of data. They’d marked up tens of thousands of news articles for an elaborate taxonomy, with hundreds of different nodes. (Again, start small.)

Since the prospect did all the prep work, they were looking for a partner to train models. So we took the usual next steps -- cutting into the data to train - test - validate some models. And it was working spectacularly -- suspiciously, in retrospect -- well, with F-scores well into the 90s.

We delivered the models, and the prospect came back and said they had generated a new test set, but our models were only scoring around 10 percent. We were baffled.

When we had them share their test data with us, we realized that they had a data lake they pulled from, and every news article was from one single day in 2012: tens of thousands of articles from one day in time in history. It just so happened that on that day, there was a natural disaster in Spain, and anytime “Spain” would run through the model, it associated the country with natural disasters. Even though the data volume was huge, not realizing it was derived from a single day substantially skewed the trained model.


Tip #3:


Track and record everything. It’s straightforward to record more information when you’re tracking data. It could be in a database somewhere else and doesn’t even have to be tied directly to the project you’re working on, but as time goes by, whatever you didn’t record is lost forever, so track it anyway.

One example here is with timestamps. If you’re seeing consistently bad annotations, knowing the timestamp can help understand who tagged which document and allow for remediations when issues arise.

Similarly, with timestamps, an analysis may indicate that there are times of day or days of the week where annotators are less reliable. For example, with data marked up after lunch from 1:00-3:00 p.m., you may look at that data with a more skeptical eye.


Tip #4:


Set aside resources for the future. Getting labeled data is the gold standard, and there are many ways to get that data, whether bootstrapping, co-opting data, or purchasing it. But the platinum standard is up-to-date, labeled data.

If we’ve learned anything in the past year, it’s that the world changes. This is especially reflected in language and text. Text analytics and NLP are such a complex problem precisely because language is ever-changing and evolving.

One example of this is smartphones. In the early aughts, smartphone features were talked about in many different terms than they are today. Whereas pixel density and the presence or absence of a media player may have been top concerns of buyers back then, the same features today would barely be a consideration. Similarly, even five years ago, a machine may not have understood that the pronoun “they” could be a third-person singular pronoun to refer to a gender-neutral person. In contrast, today, the use of non-binary pronouns abounds.

Rather than thinking that your ML product will be fixed, there will be unique linguistic changes that you can’t foresee but want to respond to, so be sure to set aside resources to do that.