Text Preprocessing in Python: Steps, Tools, and Examples

We outline the basic steps of text preprocessing, which are needed for transferring text from human language to machine-readable format for further processing. We will also discuss text preprocessing tools.

Example 12. Named-entity recognition using NLTK:

Code:

 
from nltk import word_tokenize, pos_tag, ne_chunk
input_str = “Bill works for Apple so he went to Boston for a conference.”
print ne_chunk(pos_tag(word_tokenize(input_str)))

Output:

 
(S (PERSON Bill/NNP) works/VBZ for/IN Apple/NNP so/IN he/PRP went/VBD to/TO (GPE Boston/NNP) for/IN a/DT conference/NN ./.)

Coreference resolution (anaphora resolution)

Pronouns and other referring expressions should be connected to the right individuals. Coreference resolution finds the mentions in a text that refer to the same real-world entity. For example, in the sentence, “Andrew said he would buy a car” the pronoun “he” refers to the same person, namely to “Andrew”. Coreference resolution tools: Stanford CoreNLP, spaCy, Open Calais, Apache OpenNLP are described in the “Coreference resolution” sheet of the table.

Name, Developer, Initial release	Features	Programming languages	License
Beautiful Anaphora Resolution Toolkit (BART), Massimo Poesio, Simone Ponzetto, Yannick Versley, Johns Hopkins Summer Workshop, 2007	Incorporates a variety of machine learning approaches	REST-based web service	Apache license v2.0
	Uses several machine learning toolkits
	Exports the result as inline XML [36]
JavaRAP, Long Qiu, 2004	Resolves third person pronouns, lexical anaphors, Identifies pleonastic pronouns	Java	-
	Output format: anaphor - antecedent pairs; text with in-place substitutions
	Accuracy: 57.9% (MUC6)
	around 1,500 words per second [37]
A General Tool for Anaphora Resolution - GuiTAR, University of Essex, 2007	Takes as an input a MAS-XML compliant file and adds new markup holding anaphoric information (elements).	Java	GPL License
	Evaluation module is also included [38]
Reconcile, Cornell University, The University of Utah, Lawrence Livermore National Labs, 2009	Runs on common data sets or unlabeled texts	Java	GPL License
	Utilizes supervised machine learning classifiers from the Weka toolkit, the Berkeley Parser and Stanford Named Entity Recognition system.
	MUC Score (MUC-6): recall 67.23; precision 65.54; F-measure 66.38
	[39]
ARKref, Brendan O'Connor, Michael Heilman, 2009	Deterministic, rule-based system	Java	GPL, MIT
	Uses syntactic information from a constituent parser, and semantic information from an entity recognition component
	Precision: 0.657617, recall: 0.552433, f1: 0.600454 [40]
Illinois Coreference Package, Dan Roth, Eric Bengtson, 2008	Coreference related features:	Java	-
	Gender and number match, WordNet relations including synonym, hypernym, antonym, and ACE entity types (person, organization, and geopolitical entity)
	Features anaphoricity classifier [41]
Neural coref, Hugging Face, 2017	Uses neural nets and spaCy	Python	MIT License
	Adds various speakers in the conversation when computing the features and resolving the coreferences [42]
	Online demo
coreference resolution toolkit (cort), Sebastian Martschat, Thierry Goeckel, Patrick Claus	Coreference resolution component	Python	MIT License
	Error analysis component
	Framework is based on latent variables allowing user to rapidly devise approaches to coreference resolution
	Analyzes and visualizes errors made by coreference resolution systems [43]
CherryPicker, Altaf Rahman, Vincent Ng, University of Texas at Dallas, 2009	Runs on Unix/Linux	-	Free for educational and research activities; may not be used for commercial or for-profit purposes
	Cluster-ranking coreference model [44]
FreeLing, TALP Research Center, Universitat Politècnica de Catalunya	Provides language analysis functionalities	C++	Affero GNU General Public License
	Supports a variety of languages
	Provides a command-line front-end
	output formats :XML, JSON, CoNLL [45]
eXternally configurable REference and Non Named Entity Recognizer (xrenner), Zeldes, Amir and Zhang, Shuo, Department of Linguistics at Georgetown University, 2016	Highly configurable, language independent coreferencer	Python	Apache License, Version 2.0
	Pluggable classifiers – models can work 100% rule-based or use classifiers to rank rule outputs [46]

Coreference resolution tools

An example of coreference resolution using xrenner can be found here.

Collocation extraction

Collocations are word combinations occurring together more often than would be expected by chance. Collocation examples are “break the rules,” “free time,” “draw a conclusion,” “keep in mind,” “get ready,” and so on.

Name, Developer, Initial release	Features	Programming languages	License
TermeX, Text Analysis and Knowledge Engineering Lab, University of Zagreb, 2009	UTF-8 formatted input text	Front end GUI	Freely available for research purposes upon request.
	uses 14 association measures
	Processes of n-grams up to length four
	Hand selection of candidate n-grams for terminology lexica
	Windows and Linux support
	Fast and memory efficient processing of large corpora
	[47]
Collocate, Athelstan	Uses statistical analyses (t-score, log likelihood, Mutual Information) and frequency information to present a list of candidate collocations	Application	Educational Price. Single user: $45
	Searches for a word (phrase) within a set span (e.g. 4 words).		Site license (2-year, 15 users) $395
	Produces an n-gram list
	Extracts collocations using thresholds and either mutual information [48]
CollTerm, Faculty of Humanities and Social Sciences at University of Zagreb; Research Institute for Artificial Intelligence at Romanian Academy	Results based on five different co-occurrence measures for multiword units (i.e. collocations) or distributional differences from large representative corpus by application of the TF-IDF measurement on single word units	Python	ApacheLicense_2.0
	Language independent [49]
Collocation Extractor, Dan Ștefănescu, 2012	Collocation words features in this approach:	Application	Restrictions: Inform Licensor, No Redistribution
	– the distance between them is relatively constant;		User Nature: Academic, Commercial
	– they appear together more often than expected by chance (Log-Likelihood)
	Language independent
	output annotation format: Text output with one collocation per line and annotations separated by tab [50]
ICE: Idiom and Collocation Extractor, Verizon Labs, Computer Science Dept. University of Houston, 2017	Idioms and collocations extraction	Python	Apache License 2.0
	Two user-friendly formats
	for identifying collocations offline and online Dictionary search, web search and substitution, and web search independence are used [51]
Text::NSP, University of Minnesota, Carnegie Mellon University, University of Pittsburgh, 2000	Extract collocations and N-grams from text	Perl	GNU General Public License
	Text::NSP::Measures module evaluates whether the co-occurrence of the words in an N-gram is purely by chance or statistically significant.
	[52]

Collocation extraction tools

Example 13. Collocation extraction using ICE [51]

Code:

 
input=[“he and Chazz duel with all keys on the line.”]
from ICE import CollocationExtractor
extractor = CollocationExtractor.with_collocation_pipeline(“T1” , bing_key = “Temp”,pos_check = False)
print(extractor.get_collocations_of_length(input, length = 3))

Output:

 
[“on the line”]

Relationship extraction

Relationship extraction allows obtaining structured information from unstructured sources such as raw text. Strictly stated, it is identifying relations (e.g., acquisition, spouse, employment) among named entities (e.g., people, organizations, locations). For example, from the sentence “Mark and Emily married yesterday,” we can extract the information that Mark is Emily’s husband.

Name, Developer, Initial release	Features	Programming languages	License
ReVerb, University of Washington's Turing Center	Automatically identifies and extracts binary relationships from English sentences	Java	ReVerb Software License Agreement
	Designed for Web-scale information extraction where the target relations cannot be specified in advance and speed is important.
	Inputs raw text
	Outputs (argument1, relation phrase, argument2) triples [53]
EXEMPLAR, University of Alberta, 2013	Able to identify instances of any relation described in the text	Java	GNU General Public License v3.0
	Extracts relations with two or more arguments
	Role of an argument can be SUBJ (subject), DOBJ (direct object) and POBJ (prepositional object) [54]
Toolkit for Exploring Text for Relation Extraction (TETRE), Alisson Oldoni, 2017	Accepts raw text as an input	Command line tool	MIT License
	Optimized for the task of information extraction in a corpus composed of academic papers
	Does data transformation, parsing, wraps tasks of third-party binaries
	Outputs the relations in HTML and JSON [55]
TextRazor, TextRazor, 2011	Uses state-of-the-art NLP and AI techniques	Python	Pricing
	Processes thousands of words per second per core	PHP
	Allows user to add product names, people, companies, custom classification rules and advanced linguistic patterns [56]	Java
		REST API
Information Extraction in Python (IEPY), Machinalis, 2014	Tries to predict relations using information provided by the user	Python	BSD 3-Clause "New" or "Revised" License
	Aimed to perform Information Extraction (IE) on a large dataset
	created for scientific experiments with new IE algorithms
	Configured with convenient defaults [57]
Watson Natural Language Understanding, IBM	Recognizes when two entities are related, and identify the type of relation	Curl	Pricing
	Supported languages: Arabic, English, Korean, Spanish [58]	Node
		Java
		Python
MIT Information Extraction (MITIE), E. Davis King, 2009	Binary relation detection	C, C++, Java, R, Python	Boost Software License
	tools for training custom extractors and relation detectors
	Uses of distributional word embeddings and structural Support Vector Machines
	Offers several pre-trained models
	Supports English, Spanish, and German [59]

An example of relationship extraction using NLTK can be found here.

Summary

In this post, we talked about text preprocessing and described its main steps including normalization, tokenization, stemming, lemmatization, chunking, part of speech tagging, named-entity recognition, coreference resolution, collocation extraction, and relationship extraction. We also discussed text preprocessing tools and examples. A comparative table was created.

After the text preprocessing is done, the result may be used for more complicated NLP tasks, for example, machine translation or natural language generation.

Resources:

Bio: Data Monsters help corporations and funded startups research, design, and develop real-time intelligent software to improve their business with data technologies.

Original. Reposted with permission.

Related:

Text Preprocessing in Python: Steps, Tools, and Examples

Coreference resolution (anaphora resolution)

Collocation extraction

Relationship extraction

Summary

Resources:

More On This Topic

Latest Posts

Top Posts