Introduction to Named Entity Recognition

Named Entity Recognition is a tool which invariably comes handy when we do Natural Language Processing tasks. Read on to find out how.




NLTK (Natural Language Toolkit) is a Python package that provides a set of natural languages corpora and APIs of wide varieties of NLP algorithms.

To perform Named Entity Recognition using NLTK, it needs to be done in three stages —

  1. Work Tokenization
  2. Parts of Speech (POS) tagging
  3. Named Entity Recognition
pip install nltk

Now, let’s perform the first two stages here -

import nltk
print('NTLK version: %s' % (nltk.__version__))

from nltk import word_tokenize, pos_tag, ne_chunk'words')'averaged_perceptron_tagger')'punkt')'maxent_ne_chunker')

Note, we need to download some standard corpora and API from NLTK to perform parts of speech tagging and named entity recognition. Hence, we downloaded these from nltk in the above Python code.

article = '''
Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a 
sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped 
riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2 
week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in 
electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight 
sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the
European Union over Brexit, British Prime Minister Theresa May said on Monday.'''

def fn_preprocess(art):
    art = nltk.word_tokenize(art)
    art = nltk.pos_tag(art)
    return art

art_processed = fn_preprocess(article)

Snapshot of Output (POS tagging) from the above code

Now to understand what each codes mean, please refer to the below list-

CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is” … think of it like “there exists”)
FW foreign word
IN preposition/subordinating conjunction
JJ adjective ‘big’
JJR adjective, comparative ‘bigger’
JJS adjective, superlative ‘biggest’
LS list marker 1)
MD modal could, will
NN noun, singular ‘desk’
NNS noun plural ‘desks’
NNP proper noun, singular ‘Harrison’
NNPS proper noun, plural ‘Americans’
PDT predeterminer ‘all the kids’
POS possessive ending parent’s
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO, to go ‘to’ the store.
UH interjection
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when

Now once we have done the parts-of-speech tagging we will be doing a process called chunking. Text chunking is also called as shallow parsing which typically follows POS tagging to add more structure to the sentence. The result is grouping of words in “chunks”.

So, lets perform chunking to our article which we have already POS tagged.

Our target here would be to NER tag only the Nouns.

results = ne_chunk(art_processed)

for x in str(results).split('\n'):
    if '/NN' in x:

The snapshot of the output is as follows-

Snapshot of the output from the above code

The output looks decent but not great. Say we take up a little more complex task.

Say, we want to implement noun phrase chunking to identify named entities.

Our chunk pattern consists of one rule, that a noun phrase, NP, should be formed whenever the chunker finds an optional determiner, DT, followed by any number of adjectives, JJ, and then a noun, NN.

pattern = 'NP: {<DT>?<JJ>*<NN>}'
cp = nltk.RegexpParser(pattern)
cs = cp.parse(art_processed)

The output of the above chunking is below-

Snapshot from the output from above

The output can be read as a tree with “S” means the sentence as the first level. It can viewed in a more acceptable format called IOB tags (Inside, Outside, Beginning)

from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint

iob_tagged = tree2conlltags(cs)


The snapshot of the output from the above code

Here, in the output each token is a line with parts-of-speech and named entity tagged. If you want to extract the IOB tags, as it is a tuple you simply do-

for word, pos, ner in iob_tagged:
    print(word, pos, ner)

The entire code for the NLTK NER process-

NER using NLTK


What’s next ?

So, we have just learnt what is Named Entity Recognition tagging and how to use them to solve generic problems using API’s.

The natural progression from here would be to accomplish three things -

  1. Build your own NER tagger and also explore languages other than English.
  2. Build more sophisticated NER models (let’s say using Deep Learning) and also evaluate how better they perform.
  3. Take a task which you encounter daily which deals with Natural Language, figure out a problem which you want to solve and then use all what you have learnt in NER to solve it.

I will be working on these lines and will try to share my learning in coming posts on NER. You can contribute as well, please drop me how would you like to do that in the comment section.

Happy learning :)


Bio: Suvro Banerjee is a Machine Learning Engineer @ Juniper Networks.

Original. Reposted with permission.