Named Entity Recognition and Classification with Scikit-Learn
Named Entity Recognition and Classification is a process of recognizing information units like names, including person, organization and location names, and numeric expressions from unstructured text. The goal is to develop practical and domain-independent techniques in order to detect named entities with high accuracy automatically.
Conditional Random Fields (CRFs)
CRFs is often used for labeling or parsing of sequential data, such as natural language processing and CRFs find applications in POS Tagging, named entity recognition, among others.
We will train a CRF model for named entity recognition using sklearn-crfsuite on our data set.
The following code is to retrieve sentences with their POS and tags. Thanks Tobias for the tip.
Next, we extract more features (word parts, simplified POS tags, lower/title/upper flags, features of nearby words) and convert them to
sklearn-crfsuite format — each sentence should be converted to a list of dicts. The following code were taken from sklearn-crfsuites official site.
Split train and test sets
Train a CRF model
Way better! We will stick to sklearn-crfsuite and explore more!
What our classifier learned?
Interpretation: It is very likely that the beginning of a geographical entity (B-geo) will be followed by a token inside geographical entity (I-geo), but transitions to inside of an organization name (I-org) from tokens with other labels are penalized hugely.
Check the state features
5.183603 B-tim word[-3]:dayThe model learns that if a nearby word was “day” then the token is likely a part of a Time indicator.
3.370614 B-per word.lower():presidentThe model learns that token "president" is likely to be at the beginning of a person name.
-3.521244 O postag:NNPThe model learns that proper nouns are often entities.
-3.087828 O word.isdigit()Digits are likely entities.
-3.233526 O word.istitle()TitleCased words are likely entities.
ELI5 is a Python package which allows to check weights of sklearn_crfsuite.CRF models.
Inspect model weights
- It does make sense that I-entity must follow B-entity, such as I-geo follows B-geo, I-org follows B-org, I-per follows B-per, and so on.
- We can also see that it is not common in this data set to have a person right after an organization name (B-org -> I-per has a large negative weight).
- The model learned large negative weights for impossible transitions like O -> I-geo, O -> I-org and O -> I-tim, and so on.
For easy to read, we can check only a subset of tags.
Or check only some of the features for all tags.
That was it, for now. I enjoyed making my hands dirty on sklearn-crfsuite and ELI5, hope you did too. Source code can be found at Github. Have a great week!
Bio: Susan Li is changing the world, one article at a time. She is a Sr. Data Scientist, located in Toronto, Canada.
Original. Reposted with permission.
- Multi-Class Text Classification with Scikit-Learn
- Machine Learning for Text Classification Using SpaCy in Python
- An End-to-End Project on Time Series Analysis and Forecasting with Python