Lemma, Lemma, Red Pyjama: Or, doing words with AI

If we want a machine learning model to be able to generalize these forms together, we need to map them to a shared representation. But when are two different words the same for our purposes? It depends.



By Paul Barba, Chief Scientist of Lexalytics

Language is easy for humans, but much harder for AIs. Humans are all about inference, extrapolation and pattern recognition. As we learn language, we internalize the rules around how words get inflected for things like plurality or tense. We know that “walked” is “walk” plus the “-ed” past tense ending, and that the “-s” in “dogs” typically means we’re dealing with more than one dog. We understand that these forms aren’t entirely separate words, but rather variations of a single “lemma”, which is the term given to the dictionary form of a given word or concept.

Computers, on the other hand, need much more guidance. Which is where the natural language processing (NLP) experts come in.

When is a dog not a dog? Ask a computer.

Unlike humans, computers are very literal minded. They have to be explicitly taught how to deal with words. Unless they’re told otherwise, they’ll treat “dog” and “dogs” as two completely unrelated words. The same with “dog” and “Dog”. If we want a machine learning model to be able to generalize these forms together, we need to map them to a shared representation. But when are two different words the same for our purposes? Does plurality matter? How about capitalization? Is it okay to group all of the forms of a word together without any consideration for tense, aspect or mood? It depends.

Capitalization by itself doesn’t usually change the meaning of a word, so a computer can usually be told to treat “dog” and “Dog” in the same way. However, when training a part of speech model – a model that understands whether something is a verb, noun or adjective – capitalization helps indicate the difference between the verb in “to mark a paper” and the proper noun “Mark Jones”. Capitalization is also a strong clue when developing entity recognition, and is usually combined with part-of-speech information that tells the computer whether this term is a proper noun.

We can take a similar approach to plurality, training a model or writing a rule that tells the computer that “dog-” is our lemma (or sometimes “stem”), and that “-s” is a plural marker. This can be written as a blanket, language-wide rule like our capitalization rule, but like our Mark/mark example, we’d need to write some additional sub-rules to account for exceptions like “oxen” or “ponies” or “sheep”. If plurality doesn’t matter to us, we can just group “dog” and “dogs” together with no concern for the plural.

 

Lemmatization: the fine art of word surgery

What we don’t want to do, however, is just search for and return all words with “dog” in them. Otherwise words like dogged, hotdog, boondoggle and doge (the Italian leader, not the meme) will lead our AI astray.

Nor do we want to just lop off the inflected end of a word and hope for the best. This approach is known as “stemming.” Some of the common stemming algorithms include include Porter, Snowball and Krovetz, and while they has their place in language analysis, they can also introduce errors into your data sets. Stemming works just fine with “dogs”, which has a regular plural. But take the wordform “ponies”. A stemming approach would simply remove the plural form “-s”, giving us “*ponie” as the base form. It’s even more disastrous in languages like Hebrew or Arabic, where internal vowels provide critical meaning.

To make sure our model returns the right form, we’d want to use an approach called lemmatization. Lemmatization is where you pare back a word to its underlying form, but taking into account things like grammaticality, morphology and phonology. Lemmatization would remove the “-ies” plural, but would provide the correct lemma “pony”. Lemmatization is to stemming what surgery is to butchery. Sometimes all you need is a cleaver, but sometimes a scalpel is the right tool.

There are many ways of helping a machine handle lemmas. You can train a model to learn from examples, write custom rules, derive likely rules by just looking at word variations in a large corpus. Should “accounting” and “accountant” be distinct words? For certain domains and certain problems you’ll generalize faster and more accurately if you can conflate the two. In other cases you’ll be losing information. It all depends.

A horse, a horse, my kingdom for an equine: expanding your AI’s vocabulary

So far our ML model has been mostly focused on correctly handling the word “dog”. But it’s possible that there’s plenty of great data out there that relates to dogs without explicitly mentioning the word “dog”. To capture that data, you might want to expand your model to consider words like “canine”, “German shepherd”, “man’s best friend” or even “Fido”. There are a couple of ways you can do this. One is to use a specially built thesaurus populated with synonyms for the term in question.

Another is to use word embeddings. This is where words are assigned a point in an abstract space, with the semantic distance between words determining how closely they’re plotted to other like words. This approach lets your ML model map out an area of a language rather than making decisions on a word-by-word basis.

There are two approaches to word embeddings: traditional (e.g. Word2vec) and language model (e.g. BERT). The former represents a word like “bank” as a mathematical number such that words like “river” have similar values. This offers a stronger form of normalization across distinct looking words, and can allow for much faster training but with potentially more errors in word understanding. Language models deal with the “financial bank” vs “river bank” by building an embedding for a word out of the sentence it appears in. This is one of the best ways for dealing with homonyms, but is computationally expensive and the lack of a single representation for a word can cause difficulty in tuning, understanding and using a system.

Words as works in progress: the data scientist’s choice

There are a lot of options for dealing with word identity, from creating broad thesaural classes that group every possible variant of an idea to providing a unique representation to every possible meaning of a word with lots of senses. Just as dropping tenses or plurality markers when speaking a foreign language can result in information loss, lemmatization can limit model accuracy. However, native speakers are easily able to generalize and reason about tense and plurality – making sense of even the most beginner language learner’s efforts. So while leaving every wordform in our AI models as unique might give us a highly accurate AI, we’re not actually helping them create appropriate representations of the world.

While new techniques such as language models are starting to reduce the impact of this problem, data scientists working with text should be prepared to think carefully about which language regularities are important for what they’re trying to solve, and to carefully experiment to find the right representation for the task at hand.

 
Bio: Paul Barba is the Chief Scientist of Lexalytics, where he is focused on applying force multiplying technologies to solve artificial intelligence-related challenges and drive innovation in AI even further. Paul has years of experience developing, architecting, researching and generally thinking about AI/machine learning, text analytics and natural language processing (NLP) software. He has been working on growing system understanding while reducing human intervention and bringing text analytics to web scale. Paul has expertise in diverse areas of NLP and machine learning, from sentiment analysis and machine summarization to genetic programming and bootstrapping algorithms. Paul is continuing to bring cutting edge research to solve everyday business problems, while working on new “big ideas” to push the whole field forward. Paul earned a degree in Computer Science and Mathematics from UMass Amherst.

Related: