Gold BlogNamed Entity Recognition and Classification with Scikit-Learn

Named Entity Recognition and Classification is a process of recognizing information units like names, including person, organization and location names, and numeric expressions from unstructured text. The goal is to develop practical and domain-independent techniques in order to detect named entities with high accuracy automatically.

Conditional Random Fields (CRFs)

CRFs is often used for labeling or parsing of sequential data, such as natural language processing and CRFs find applications in POS Tagging, named entity recognition, among others.


We will train a CRF model for named entity recognition using sklearn-crfsuite on our data set.

import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics
from collections import Counter

The following code is to retrieve sentences with their POS and tags. Thanks Tobias for the tip.

class SentenceGetter(object):
    def __init__(self, data):
        self.n_sent = 1 = data
        self.empty = False
        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s['Word'].values.tolist(), 
        self.grouped ='Sentence #').apply(agg_func)
        self.sentences = [s for s in self.grouped]
    def get_next(self):
            s = self.grouped['Sentence: {}'.format(self.n_sent)]
            self.n_sent += 1
            return s 
            return None

getter = SentenceGetter(df)
sentences = getter.sentences

Feature Extraction

Next, we extract more features (word parts, simplified POS tags, lower/title/upper flags, features of nearby words) and convert them to sklearn-crfsuite format — each sentence should be converted to a list of dicts. The following code were taken from sklearn-crfsuites official site.

def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    features = {
        'bias': 1.0, 
        'word.lower()': word.lower(), 
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        features['BOS'] = True
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        features['EOS'] = True

    return features

def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

Split train and test sets

X = [sent2features(s) for s in sentences]
y = [sent2labels(s) for s in sentences]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

Train a CRF model

crf = sklearn_crfsuite.CRF(
), y_train)

Figure 14


y_pred = crf.predict(X_test)
print(metrics.flat_classification_report(y_test, y_pred, labels = new_classes))

Figure 15

Way better! We will stick to sklearn-crfsuite and explore more!

What our classifier learned?

def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))

print("Top likely transitions:")

print("\nTop unlikely transitions:")

Figure 16

Interpretation: It is very likely that the beginning of a geographical entity (B-geo) will be followed by a token inside geographical entity (I-geo), but transitions to inside of an organization name (I-org) from tokens with other labels are penalized hugely.

Check the state features

def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-8s %s" % (weight, label, attr))

print("Top positive:")

print("\nTop negative:")

Figure 17


  1. 5.183603 B-tim word[-3]:day The model learns that if a nearby word was “day” then the token is likely a part of a Time indicator.
  2. 3.370614 B-per word.lower():president The model learns that token "president" is likely to be at the beginning of a person name.
  3. -3.521244 O postag:NNP The model learns that proper nouns are often entities.
  4. -3.087828 O word.isdigit() Digits are likely entities.
  5. -3.233526 O word.istitle() TitleCased words are likely entities.


ELI5 is a Python package which allows to check weights of sklearn_crfsuite.CRF models.

Inspect model weights

import eli5
eli5.show_weights(crf, top=10)

Figure 18


  1. It does make sense that I-entity must follow B-entity, such as I-geo follows B-geo, I-org follows B-org, I-per follows B-per, and so on.
  2. We can also see that it is not common in this data set to have a person right after an organization name (B-org -> I-per has a large negative weight).
  3. The model learned large negative weights for impossible transitions like O -> I-geo, O -> I-org and O -> I-tim, and so on.

For easy to read, we can check only a subset of tags.

eli5.show_weights(crf, top=10, targets=['O', 'B-org', 'I-per'])

Figure 19

Or check only some of the features for all tags.

eli5.show_weights(crf, top=10, feature_re='^word\.is',
                  horizontal_layout=False, show=['targets'])

Figure 20

That was it, for now. I enjoyed making my hands dirty on sklearn-crfsuite and ELI5, hope you did too. Source code can be found at Github. Have a great week!


Bio: Susan Li is changing the world, one article at a time. She is a Sr. Data Scientist, located in Toronto, Canada.

Original. Reposted with permission.