Natural Language in Python using spaCy: An Introduction
This article provides a brief introduction to working with natural language (sometimes called “text analytics”) in Python using spaCy and related libraries.
By Paco Nathan
This article provides a brief introduction to natural language using spaCy and related libraries in Python. The complementary Domino project is also available.
This article and paired Domino project provide a brief introduction to working with natural language (sometimes called “text analytics”) in Python using spaCy and related libraries. Data science teams in industry must work with lots of text, one of the top four categories of data used in machine learning. Usually it’s human-generated text, but not always.
Think about it: how does the “operating system” for business work? Typically, there are contracts (sales contracts, work agreements, partnerships), there are invoices, there are insurance policies, there are regulations and other laws, and so on. All of those are represented as text.
You may run across a few acronyms: natural language processing (NLP), natural language understanding (NLU), natural language generation (NLG)—which are roughly speaking “read text”, “understand meaning”, “write text” respectively. Increasingly these tasks overlap and it becomes difficult to categorize any given feature.
The spaCy framework—along with a wide and growing range of plug-ins and other integrations—provides features for a wide range of natural language tasks. It’s become one of the most widely used natural language libraries in Python for industry use cases, and has quite a large community—and with that, much support for commercialization of research advances as this area continues to evolve rapidly.
We have configured the default Compute Environment in Domino to include all of the packages, libraries, models, and data you’ll need for this tutorial. Check out the Domino project to run the code.
If you’re interested in how Domino’s Compute Environments work, check out the Support Page.
Now let’s load spaCy and run some code:
nlp variable is now your gateway to all things spaCy and loaded with the
en_core_web_sm small model for English. Next, let’s run a small “document” through the natural language parser:
First we created a doc from the text, which is a container for a document and all of its annotations. Then we iterated through the document to see what spaCy had parsed.
Good, but it’s a lot of info and a bit difficult to read. Let’s reformat the spaCy parse of that sentence as a pandas dataframe:
Much more readable! In this simple case, the entire document is merely one short sentence. For each word in that sentence spaCy has created a token, and we accessed fields in each token to show:
- raw text
- lemma – a root form of the word
- part of speech
- a flag for whether the word is a stopword—i.e., a common word that may be filtered out
Next let’s use the displaCy library to visualize the parse tree for that sentence:
Does that bring back memories of grade school? Frankly, for those of us coming from more of a computational linguistics background, that diagram sparks joy.
But let’s backup for a moment. How do you handle multiple sentences?
There are features for sentence boundary detection (SBD)—also known as sentence segmentation—based on the builtin/default sentencizer:
When spaCy creates a document, it uses a principle of non-destructive tokenization, meaning that the tokens, sentences, etc., are simply indexes into a long array. In other words, they don’t carve the text stream into little pieces. So each sentence is a span with a start and an end index into the document array:
We can index into the document array to pull out the tokens for one sentence:
Or simply index into a specific token, such as the verb
went in the last sentence:
At this point we can parse a document, segment that document into sentences, then look at annotations about the tokens in each sentence. That’s a good start.
Now that we can parse texts, where do we get texts? One quick source is to leverage the interwebs. Of course when we download web pages we’ll get HTML, and then need to extract text from them. Beautiful Soup is a popular package for that.
First, a little housekeeping:
In the following function
get_text() we’ll parse the HTML to find all of the
<p/>tags, then extract the text for those:
Now let’s grab some text from online sources. We can compare open source licenses hosted on the Open Source Initiative site:
One common use case for natural language work is to compare texts. For example, with those open source licenses we can download their text, parse, then compare similarity metrics among them:
Admittedly, there was some extra text included in each document due to the OSI disclaimer in the footer—but this provides a reasonable approximation for comparing the licenses.
Natural Language Understanding
Now let’s dive into some of the spaCy features for NLU. Given that we have a parse of a document, from a purely grammatical standpoint we can pull the noun chunks, i.e., each of the noun phrases:
Not bad. The noun phrases in a sentence generally provide more information content—as a simple filter used to reduce a long document into a more “distilled” representation.
We can take this approach further and identify named entities within the text, i.e., the proper nouns:
The displaCy library provides an excellent way to visualize named entities:
If you’re working with knowledge graph applications and other linked data, your challenge is to construct links between the named entities in a document and other related information for the entities, which is called entity linking. Identifying the named entities in a document is the first step in this particular kind of AI work. For example, given the text above, one might link the
Steve Wozniak named entity to a lookup in DBpedia.
In more general terms, one can also link lemmas to resources that describe their meanings. For example, in an early section we parsed the sentence
The gorillas just went wild and were able to show that the lemma for the word
went is the verb
go. At this point we can use a venerable project called WordNet which provides a lexical database for English—in other words, it’s a computable thesaurus.
Then we’ll load the WordNet data via NLTK (these things happen):
Note that spaCy runs as a “pipeline” and allows means for customizing parts of the pipeline in use. That’s excellent for supporting really interesting workflow integrations in data science work. Here we’ll add the WordnetAnnotator from the spacy-wordnet project:
Within the English language, some words are infamous for having many possible meanings. For example, click through the results online in a WordNet search to find the meanings related to the word
Now let’s use spaCy to perform that lookup automatically:
Again, if you are working with knowledge graphs, those “word sense” links from WordNet could be used along with graph algorithms to help identify the meanings for a particular word. This can also be used to develop summaries for larger sections of text through a technique called summarization. It’s beyond the scope of this tutorial, but an interesting application currently for natural language in industry.
Going in the other direction, if you know a priori that a document was about a particular domain or set of topics, then you can constrain the meanings returned from WordNet. In the following example, we want to consider NLU results that are within Finance and Banking:
That example may look simple but, if you play with the
domains list, you’ll find that the results have a kind of combinatorial explosion when run without reasonable constraints. Imagine having a knowledge graph with millions of elements: you’d want to constrain searches where possible to avoid having every query take days/weeks/months/years to compute.
Sometimes the problems encountered when trying to understand a text—or better yet when trying to understand a corpus (a dataset with many related texts)—become so complex that you need to visualize it first. Here’s an interactive visualization for understanding texts: scattertext, a product of the genius of Jason Kessler.
Let’s analyze text data from the party conventions during the 2012 US Presidential elections. Note: this cell may take a few minutes to run but the results from all that number crunching is worth the wait.
Once you have the
corpus ready, generate an interactive visualization in HTML:
Now we’ll render the HTML—give it a minute or two to load, it’s worth the wait:
Imagine if you had text from the past three years of customer support for a particular product in your organization. Suppose your team needed to understand how customers have been talking about the product? This scattertext library might come in quite handy! You could cluster (k=2) on NPS scores (a customer evaluation metric) then replace the Democrat/Republican dimension with the top two components from the clustering.
Five years ago, if you’d asked about open source in Python for natural language, a default answer from many people working in data science would’ve been NLTK. That project includes just about everything but the kitchen sink and has components which are relatively academic. Another popular natural language project is CoreNLP from Stanford. Also quite academic, albeit powerful, though CoreNLP can be challenging to integrate with other software for production use.
Then a few years ago everything in this natural language corner of the world began to change. The two principal authors for spaCy, Matthew Honnibal and Ines Montani, launched the project in 2015 and industry adoption was rapid. They focused on an opinionated approach (do what’s needed, do it well, no more, no less) which provided simple, rapid integration into data science workflows in Python, as well as faster execution and better accuracy than the alternatives. Based on these priorities, spaCy became sort of the opposite of NLTK. Since 2015, spaCy has consistently focused on being an open source project (i.e., depending on its community for directions, integrations, etc.) and being commercial-grade software (not academic research). That said, spaCy has been quick to incorporate the SOTA advances in machine learning, effectively becoming a conduit for moving research into industry.
It’s important to note that machine learning for natural language got a big boost during the mid-2000’s as Google began to win international language translation competitions. Another big change occurred during 2017-2018 when, following the many successes of deep learning, those approaches began to out-perform previous machine learning models. For example, see the ELMo work on language embedding by Allen AI, followed by BERT from Google, and more recently ERNIE by Baidu—in other words, the search engine giants of the world have gifted the rest of us with a Sesame Street repertoire of open source embedded language models based on deep learning, which is now state of the art (SOTA). Speaking of which, to keep track of SOTA for natural language keep an eye on NLP-Progress and Papers with Code.
The use cases for natural language have shifted dramatically over the past two years, after deep learning techniques arose to the fore. Circa 2014, a natural language tutorial in Python might have shown word count or keyword search or sentiment detection and the target use cases were relatively underwhelming. Circa 2019, we’re talking about analyzing thousands of documents for vendor contracts in an industrial supply chain optimization…or hundreds of millions of documents for policyholders of an insurance company or gazillions of documents regarding financial disclosures. More contemporary natural language work tends to be in NLU, often to support construction of knowledge graphs, and increasingly in NLG where large numbers of similar documents can be summarized at human scale.
The spaCy Universe is a great place to check for deep-dives into particular use cases and to see how this field is evolving. Some selections from this “universe” include:
- Blackstone – parsing unstructured legal texts
- Kindred – extracting entities from biomedical texts (e.g., Pharma)
- mordecai – parsing geographic information
- Prodigy – human-in-the-loop annotation for labelling datasets
- spacy-raspberry – Raspberry PI image for running spaCy and deep learning on edge devices
- Rasa NLU – Rasa integration for chat apps
Also, a couple super new items to mention:
- spacy-pytorch-transformers to fine tune (i.e., use transfer learning with) the Sesame Street characters and friends: BERT, GPT-2, XLNet, etc.
- spaCy IRL 2019 conference – check out videos from the talks!
There’s so much more we can be done with spaCy— hopefully this tutorial provides an introduction. We wish you all the best in your natural language work.
Original. Reposted with permission.
- The State of Transfer Learning in NLP
- Reddit Post Classification
- 2018’s Top 7 Python Libraries for Data Science and AI