Named Entity Recognition and Classification with Scikit-Learn
Named Entity Recognition and Classification is a process of recognizing information units like names, including person, organization and location names, and numeric expressions from unstructured text. The goal is to develop practical and domain-independent techniques in order to detect named entities with high accuracy automatically.
By Susan Li, Sr. Data Scientist
Named Entity Recognition and Classification (NERC) is a process of recognizing information units like names, including person, organization and location names, and numeric expressions including time, date, money and percent expressions from unstructured text. The goal is to develop practical and domain-independent techniques in order to detect named entities with high accuracy automatically.
Last week, we gave an introduction on Named Entity Recognition (NER) in NLTK and SpaCy. Today, we go a step further, training machine learning models for NER using some of Scikit-Learn’s libraries. Let’s get started!
Essential info about entities:
- geo = Geographical Entity
- org = Organization
- per = Person
- gpe = Geopolitical Entity
- tim = Time indicator
- art = Artifact
- eve = Event
- nat = Natural Phenomenon
The IOB (short for inside, outside, beginning) is a common tagging format for tagging tokens.
- I- prefix before a tag indicates that the tag is inside a chunk.
- B- prefix before a tag indicates that the tag is the beginning of a chunk.
- An O tag indicates that a token belongs to no chunk (outside).
The entire data set can not be fit into the memory of a single computer, so we select the first 100,000 records, and use Out-of-core learning algorithms to efficiently fetch and process the data.
We notice that there are many NaN values in ‘Sentence #” column, and we fill NaN by preceding values.
(4544, 10922, 17)
We have 4,544 sentences that contain 10,922 unique words and tagged by 17 tags.
The tags are not evenly distributed.
The following code transform the text date to vector using
DictVectorizerand then split to train and test sets.
((67000, 15507), (67000,))
We will try some of the out-of-core algorithms that are designed to process data that is too large to fit into a single computer memory that support
Because tag “O” (outside) is the most common tag and it will make our results look much better than they actual are. So we remove tag “O” when we evaluate classification metrics.
Linear classifiers with SGD training
Naive Bayes classifier for multinomial models
Passive Aggressive Classifier
None of the above classifiers produced satisfying results. It is obvious that it is not going to be easy to classify named entities using regular classifiers.