Named Entity Recognition and Classification with Scikit-Learn
Named Entity Recognition and Classification is a process of recognizing information units like names, including person, organization and location names, and numeric expressions from unstructured text. The goal is to develop practical and domain-independent techniques in order to detect named entities with high accuracy automatically.
Conditional Random Fields (CRFs)
CRFs is often used for labeling or parsing of sequential data, such as natural language processing and CRFs find applications in POS Tagging, named entity recognition, among others.
sklearn-crfsuite
We will train a CRF model for named entity recognition using sklearn-crfsuite on our data set.
import sklearn_crfsuite from sklearn_crfsuite import scorers from sklearn_crfsuite import metrics from collections import Counter
The following code is to retrieve sentences with their POS and tags. Thanks Tobias for the tip.
class SentenceGetter(object): def __init__(self, data): self.n_sent = 1 self.data = data self.empty = False agg_func = lambda s: [(w, p, t) for w, p, t in zip(s['Word'].values.tolist(), s['POS'].values.tolist(), s['Tag'].values.tolist())] self.grouped = self.data.groupby('Sentence #').apply(agg_func) self.sentences = [s for s in self.grouped] def get_next(self): try: s = self.grouped['Sentence: {}'.format(self.n_sent)] self.n_sent += 1 return s except: return None getter = SentenceGetter(df) sentences = getter.sentences
Feature Extraction
Next, we extract more features (word parts, simplified POS tags, lower/title/upper flags, features of nearby words) and convert them to sklearn-crfsuite
format — each sentence should be converted to a list of dicts. The following code were taken from sklearn-crfsuites official site.
def word2features(sent, i): word = sent[i][0] postag = sent[i][1] features = { 'bias': 1.0, 'word.lower()': word.lower(), 'word[-3:]': word[-3:], 'word[-2:]': word[-2:], 'word.isupper()': word.isupper(), 'word.istitle()': word.istitle(), 'word.isdigit()': word.isdigit(), 'postag': postag, 'postag[:2]': postag[:2], } if i > 0: word1 = sent[i-1][0] postag1 = sent[i-1][1] features.update({ '-1:word.lower()': word1.lower(), '-1:word.istitle()': word1.istitle(), '-1:word.isupper()': word1.isupper(), '-1:postag': postag1, '-1:postag[:2]': postag1[:2], }) else: features['BOS'] = True if i < len(sent)-1: word1 = sent[i+1][0] postag1 = sent[i+1][1] features.update({ '+1:word.lower()': word1.lower(), '+1:word.istitle()': word1.istitle(), '+1:word.isupper()': word1.isupper(), '+1:postag': postag1, '+1:postag[:2]': postag1[:2], }) else: features['EOS'] = True return features def sent2features(sent): return [word2features(sent, i) for i in range(len(sent))] def sent2labels(sent): return [label for token, postag, label in sent] def sent2tokens(sent): return [token for token, postag, label in sent]
Split train and test sets
X = [sent2features(s) for s in sentences] y = [sent2labels(s) for s in sentences] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
Train a CRF model
crf = sklearn_crfsuite.CRF( algorithm='lbfgs', c1=0.1, c2=0.1, max_iterations=100, all_possible_transitions=True ) crf.fit(X_train, y_train)
Figure 14
Evaluation
y_pred = crf.predict(X_test) print(metrics.flat_classification_report(y_test, y_pred, labels = new_classes))
Figure 15
Way better! We will stick to sklearn-crfsuite and explore more!
What our classifier learned?
def print_transitions(trans_features): for (label_from, label_to), weight in trans_features: print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight)) print("Top likely transitions:") print_transitions(Counter(crf.transition_features_).most_common(20)) print("\nTop unlikely transitions:") print_transitions(Counter(crf.transition_features_).most_common()[-20:])
Figure 16
Interpretation: It is very likely that the beginning of a geographical entity (B-geo) will be followed by a token inside geographical entity (I-geo), but transitions to inside of an organization name (I-org) from tokens with other labels are penalized hugely.
Check the state features
def print_state_features(state_features): for (attr, label), weight in state_features: print("%0.6f %-8s %s" % (weight, label, attr)) print("Top positive:") print_state_features(Counter(crf.state_features_).most_common(30)) print("\nTop negative:") print_state_features(Counter(crf.state_features_).most_common()[-30:])
Figure 17
Observations:
5.183603 B-tim word[-3]:day
The model learns that if a nearby word was “day” then the token is likely a part of a Time indicator.3.370614 B-per word.lower():president
The model learns that token "president" is likely to be at the beginning of a person name.-3.521244 O postag:NNP
The model learns that proper nouns are often entities.-3.087828 O word.isdigit()
Digits are likely entities.-3.233526 O word.istitle()
TitleCased words are likely entities.
ELI5
ELI5 is a Python package which allows to check weights of sklearn_crfsuite.CRF models.
Inspect model weights
import eli5 eli5.show_weights(crf, top=10)
Figure 18
Observations:
- It does make sense that I-entity must follow B-entity, such as I-geo follows B-geo, I-org follows B-org, I-per follows B-per, and so on.
- We can also see that it is not common in this data set to have a person right after an organization name (B-org -> I-per has a large negative weight).
- The model learned large negative weights for impossible transitions like O -> I-geo, O -> I-org and O -> I-tim, and so on.
For easy to read, we can check only a subset of tags.
eli5.show_weights(crf, top=10, targets=['O', 'B-org', 'I-per'])
Figure 19
Or check only some of the features for all tags.
eli5.show_weights(crf, top=10, feature_re='^word\.is', horizontal_layout=False, show=['targets'])
Figure 20
That was it, for now. I enjoyed making my hands dirty on sklearn-crfsuite and ELI5, hope you did too. Source code can be found at Github. Have a great week!
References:
Bio: Susan Li is changing the world, one article at a time. She is a Sr. Data Scientist, located in Toronto, Canada.
Original. Reposted with permission.
Related:
- Multi-Class Text Classification with Scikit-Learn
- Machine Learning for Text Classification Using SpaCy in Python
- An End-to-End Project on Time Series Analysis and Forecasting with Python