Silver BlogThe Big Bad NLP Database: Access Nearly 300 Datasets

Check out this database of nearly 300 freely-accessible NLP datasets, curated from around the internet.

When looking to hone your natural language processing (NLP) skills, finding accessible and relevant datasets can be one of the biggest bottlenecks of the experience. Lots of time can be spent trying to locate existing datasets for the learning task at hand, or attempting to curate your own data instead. It would be great to have a centralized listing of available NLP datasets... wouldn't it?

That's where The Big Bad NLP Database (BBNLPDB), managed by Quantum Stat, comes in. If you are seeking datasets to work on your NLP skills, you should definitely check out.



BBNLPDB provides access to nearly 300 well-organized, sortable, and searchable natural language processing datasets.

Here you can find datasets ready to go for common NLP tasks and needs, such as document classification, question answering, automated image captioning, dialog, clustering, intent classification, language modeling, machine translation, text corpora, and more.

One drawback is that most of the datasets are in English, though a few Arabic, Chinese, German, Dutch, and various Indian language entries do exist, as do a number of multi-lingual datasets.

Do you have a dataset that is not included in the listing? Let them know, and they might add it.

Before anyone says it: sure, conveniently accessible and well-thought out datasets are not representative of the real world. But that's not a concern for when you are working on fine-tuning your technical skills. Standard datasets are also great for benchmarking, and the regular sets for all of the various types of common tasks are available in the BBNLPDB as well.

Check out the BBNLPDB yourself if you are in the market for NLP datasets. At the very least, you might find a centralized location for accessing some of the more common and frequently used sets in the field.