Text Preprocessing in Python: Steps, Tools, and Examples
We outline the basic steps of text preprocessing, which are needed for transferring text from human language to machine-readable format for further processing. We will also discuss text preprocessing tools.
Example 12. Named-entity recognition using NLTK:
Code:
from nltk import word_tokenize, pos_tag, ne_chunk input_str = “Bill works for Apple so he went to Boston for a conference.” print ne_chunk(pos_tag(word_tokenize(input_str)))
Output:
(S (PERSON Bill/NNP) works/VBZ for/IN Apple/NNP so/IN he/PRP went/VBD to/TO (GPE Boston/NNP) for/IN a/DT conference/NN ./.)
Coreference resolution (anaphora resolution)
Pronouns and other referring expressions should be connected to the right individuals. Coreference resolution finds the mentions in a text that refer to the same real-world entity. For example, in the sentence, “Andrew said he would buy a car” the pronoun “he” refers to the same person, namely to “Andrew”. Coreference resolution tools: Stanford CoreNLP, spaCy, Open Calais, Apache OpenNLP are described in the “Coreference resolution” sheet of the table.
Name, Developer, Initial release |
Features |
Programming languages |
License |
Incorporates a variety of machine learning approaches |
REST-based web service |
||
Uses several machine learning toolkits |
|||
Resolves third person pronouns, lexical anaphors, Identifies pleonastic pronouns |
Java |
- |
|
Output format: anaphor - antecedent pairs; text with in-place substitutions |
|||
Accuracy: 57.9% (MUC6) |
|||
A General Tool for Anaphora Resolution - GuiTAR, |
Takes as an input a MAS-XML compliant file and adds new markup holding anaphoric information (elements). |
Java |
|
Reconcile, |
Runs on common data sets or unlabeled texts |
Java |
|
Utilizes supervised machine learning classifiers from the Weka toolkit, the Berkeley Parser and Stanford Named Entity Recognition system. |
|||
MUC Score (MUC-6): recall 67.23; precision 65.54; F-measure 66.38 |
|||
Deterministic, rule-based system |
Java |
||
Uses syntactic information from a constituent parser, and semantic information from an entity recognition component |
|||
Coreference related features: |
Java |
- |
|
Gender and number match, WordNet relations including synonym, hypernym, antonym, and ACE entity types (person, organization, and geopolitical entity) |
|||
Uses neural nets and spaCy |
Python |
||
coreference resolution toolkit (cort), |
Coreference resolution component |
Python |
|
Error analysis component |
|||
Framework is based on latent variables allowing user to rapidly devise approaches to coreference resolution |
|||
Analyzes and visualizes errors made by coreference resolution systems [43] |
|||
CherryPicker, |
Runs on Unix/Linux |
- |
Free for educational and research activities; may not be used for commercial or for-profit purposes |
FreeLing, |
Provides language analysis functionalities |
C++ |
|
Supports a variety of languages |
|||
Provides a command-line front-end |
|||
Highly configurable, language independent coreferencer |
Python |
||
Pluggable classifiers – models can work 100% rule-based or use classifiers to rank rule outputs [46] |
|||
|
Coreference resolution tools
An example of coreference resolution using xrenner can be found here.
Collocation extraction
Collocations are word combinations occurring together more often than would be expected by chance. Collocation examples are “break the rules,” “free time,” “draw a conclusion,” “keep in mind,” “get ready,” and so on.
Name, Developer, Initial release |
Features |
Programming languages |
License |
TermeX, |
UTF-8 formatted input text |
Freely available for research purposes upon request. |
|
uses 14 association measures |
|||
Processes of n-grams up to length four |
|||
Hand selection of candidate n-grams for terminology lexica |
|||
Windows and Linux support |
|||
Fast and memory efficient processing of large corpora |
|||
Uses statistical analyses (t-score, log likelihood, Mutual Information) and frequency information to present a list of candidate collocations |
Application |
||
Searches for a word (phrase) within a set span (e.g. 4 words). |
|||
Produces an n-gram list |
|
||
Extracts collocations using thresholds and either mutual information [48] |
|
||
Results based on five different co-occurrence measures for multiword units (i.e. collocations) or distributional differences from large representative corpus by application of the TF-IDF measurement on single word units |
Python |
||
Collocation words features in this approach: |
Restrictions: Inform Licensor, No Redistribution |
||
– the distance between them is relatively constant; |
User Nature: Academic, Commercial |
||
– they appear together more often than expected by chance (Log-Likelihood) |
|
||
Language independent |
|
||
|
|||
Idioms and collocations extraction |
Python |
||
Two user-friendly formats |
|||
Text::NSP, |
Extract collocations and N-grams from text |
Perl |
|
Text::NSP::Measures module evaluates whether the co-occurrence of the words in an N-gram is purely by chance or statistically significant. |
|||
Collocation extraction tools
Example 13. Collocation extraction using ICE [51]
Code:
input=[“he and Chazz duel with all keys on the line.”] from ICE import CollocationExtractor extractor = CollocationExtractor.with_collocation_pipeline(“T1” , bing_key = “Temp”,pos_check = False) print(extractor.get_collocations_of_length(input, length = 3))
Output:
[“on the line”]
Relationship extraction
Relationship extraction allows obtaining structured information from unstructured sources such as raw text. Strictly stated, it is identifying relations (e.g., acquisition, spouse, employment) among named entities (e.g., people, organizations, locations). For example, from the sentence “Mark and Emily married yesterday,” we can extract the information that Mark is Emily’s husband.
Name, Developer, Initial release |
Features |
Programming languages |
License |
Automatically identifies and extracts binary relationships from English sentences |
Java |
||
Designed for Web-scale information extraction where the target relations cannot be specified in advance and speed is important. |
|||
Inputs raw text |
|||
Outputs (argument1, relation phrase, argument2) triples [53] |
|||
Able to identify instances of any relation described in the text |
Java |
||
Extracts relations with two or more arguments |
|||
Role of an argument can be SUBJ (subject), DOBJ (direct object) and POBJ (prepositional object) [54] |
|||
Toolkit for Exploring Text for Relation Extraction (TETRE), |
Accepts raw text as an input |
Command line tool |
|
Optimized for the task of information extraction in a corpus composed of academic papers |
|||
Does data transformation, parsing, wraps tasks of third-party binaries |
|||
Uses state-of-the-art NLP and AI techniques |
Python |
||
Processes thousands of words per second per core |
PHP |
||
Java |
|||
|
REST API |
||
Tries to predict relations using information provided by the user |
Python |
||
Aimed to perform Information Extraction (IE) on a large dataset |
|||
created for scientific experiments with new IE algorithms |
|||
Recognizes when two entities are related, and identify the type of relation |
Curl |
||
Node |
|||
|
Java |
||
|
Python |
||
Binary relation detection |
C, C++, Java, R, Python |
Boost Software License |
|
tools for training custom extractors and relation detectors |
|||
Uses of distributional word embeddings and structural Support Vector Machines |
|||
Offers several pre-trained models |
|||
An example of relationship extraction using NLTK can be found here.
Summary
In this post, we talked about text preprocessing and described its main steps including normalization, tokenization, stemming, lemmatization, chunking, part of speech tagging, named-entity recognition, coreference resolution, collocation extraction, and relationship extraction. We also discussed text preprocessing tools and examples. A comparative table was created.
After the text preprocessing is done, the result may be used for more complicated NLP tasks, for example, machine translation or natural language generation.
Resources:
- http://www.nltk.org/index.html
- http://textblob.readthedocs.io/en/dev/
- https://spacy.io/usage/facts-figures
- https://radimrehurek.com/gensim/index.html
- https://opennlp.apache.org/
- http://opennmt.net/
- https://gate.ac.uk/
- https://uima.apache.org/
- https://www.clips.uantwerpen.be/pages/MBSP#tokenizer
- https://rapidminer.com/
- http://mallet.cs.umass.edu/
- https://www.clips.uantwerpen.be/pages/pattern
- https://nlp.stanford.edu/software/tokenizer.html#About
- https://tartarus.org/martin/PorterStemmer/
- http://www.nltk.org/api/nltk.stem.html
- https://snowballstem.org/
- https://pypi.python.org/pypi/PyStemmer/1.0.1
- https://www.elastic.co/guide/en/elasticsearch/guide/current/hunspell.html
- https://lucene.apache.org/core/
- https://dkpro.github.io/dkpro-core/
- http://ucrel.lancs.ac.uk/claws/
- http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
- https://en.wikipedia.org/wiki/Shallow_parsing
- https://cogcomp.org/page/software_view/Chunker
- https://github.com/dstl/baleen
- https://github.com/CogComp/cogcomp-nlp/tree/master/ner
- https://github.com/lasigeBioTM/MER
- https://blog.paralleldots.com/product/dig-relevant-text-elements-entity-extraction-api/
- http://www.opencalais.com/about-open-calais/
- http://alias-i.com/lingpipe/index.html
- https://github.com/glample/tagger
- http://minorthird.sourceforge.net/old/doc/
- https://www.ibm.com/support/knowledgecenter/en/SS8NLW_10.0.0/com.ibm.watson.wex.aac.doc/aac-tasystemt.html
- https://www.poolparty.biz/
- https://www.basistech.com/text-analytics/rosette/entity-extractor/
- http://www.bart-coref.org/index.html
- https://wing.comp.nus.edu.sg/~qiu/NLPTools/JavaRAP.html
- http://cswww.essex.ac.uk/Research/nle/GuiTAR/
- https://www.cs.utah.edu/nlp/reconcile/
- https://github.com/brendano/arkref
- https://cogcomp.org/page/software_view/Coref
- https://medium.com/huggingface/state-of-the-art-neural-coreference-resolution-for-chatbots-3302365dcf30
- https://github.com/smartschat/cort
- http://www.hlt.utdallas.edu/~altaf/cherrypicker/
- http://nlp.lsi.upc.edu/freeling/
- https://corpling.uis.georgetown.edu/xrenner/#
- http://takelab.fer.hr/termex_s/
- https://www.athel.com/colloc.html
- http://linghub.lider-project.eu/metashare/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75
- http://ws.racai.ro:9191/narratives/batch2/Colloc.pdf
- http://www.aclweb.org/anthology/E17-3027
- https://metacpan.org/pod/Text::NSP
- https://github.com/knowitall/reverb
- https://github.com/U-Alberta/exemplar
- https://github.com/aoldoni/tetre
- https://www.textrazor.com/technology
- https://github.com/machinalis/iepy
- https://www.ibm.com/watson/developercloud/natural-language-understanding/api/v1/#relations
- https://github.com/mit-nlp/MITIE
Bio: Data Monsters help corporations and funded startups research, design, and develop real-time intelligent software to improve their business with data technologies.
Original. Reposted with permission.
Related:
- Data Representation for Natural Language Processing Tasks
- Self-Service Data Prep Tools vs Enterprise-Level Solutions? 6 Lessons Learned
- Introduction to Active Learning