Natural Language Processing Recipes: Best Practices and Examples
Here is an overview of another great natural language processing resource, this time from Microsoft, which demonstrates best practices and implementation guidelines for a variety of tasks and scenarios.
We at KDnuggets have been doing our best to highlight some quality natural language processing (NLP) resources in the recent past, most notably The Big Bad NLP Database and The Super Duper NLP Repo, a pair of initiatives managed by Quantum Stat. The first of these is a curated repository of NLP datasets neatly organized around tasks, while the second is a collection of Google Colab notebooks demonstrating implementations of numerous of these tasks.
In this vein, we have found that the Natural Language Processing Best Practices & Examples repository, by Microsoft, is another worthy addition to this collection. The repository describes its usefulness as such:
This repository contains examples and best practices for building NLP systems, provided as Jupyter notebooks and utility functions. The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language.
The notebooks and utility functions it describes are less end-to-end solutions to NLP tasks as they are guidance for ensuring you imeplement your systems with the best practices in mind.
The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems. [...] We hope that the tools can significantly reduce the “time to market” by simplifying the experience from defining the business problem to development of solution by orders of magnitude. In addition, the example notebooks would serve as guidelines and showcase best practices and usage of the tools in a wide variety of languages.
Emphasizing the multi-language principles of Emily Bender, the repo also lays out that NLP "is not synonymous with English," and ensure that the goal of the project "is to provide end-to-end examples in as many languages as possible," encouraging community contributions in order to facilitate.
The repo contains a series of Jupyter notebooks, collections of which implement the following NLP scenarios in the manner in which they are additionally described in the table:
And the guides are not one dimensional; take, for example, the text classification notebooks. There are a few different notebooks which use different combinations of dataset, natural language, environment (local or Azure cloud-based), language model, and task focus.
The notebooks also lean on scripts in the utils_nlp module to help alleviate some of the "tedious tasks ranging from data loading, dataset understanding, model development, model evaluation to productionize a trained NLP model." Be sure to check out the utilities developed by Microsoft Research which are intended to save time and speed up some of the more laborious tasks associated with natural language processing.
I'm a sucker for almost anything NLP, from learning resources, to example notebooks, to frameworks and libraries, to language models, to dataset collections, and beyond. If you are too, I suggest you check out this best practice-oriented repo from Microsoft.
- The Super Duper NLP Repo: 100 Ready-to-Run Colab Notebooks
- The Big Bad NLP Database: Access Nearly 300 Datasets
- Tokenization and Text Data Preparation with TensorFlow & Keras