Text Preprocessing in Python: Steps, Tools, and Examples

We outline the basic steps of text preprocessing, which are needed for transferring text from human language to machine-readable format for further processing. We will also discuss text preprocessing tools.



Example 12. Named-entity recognition using NLTK:

Code:

 
from nltk import word_tokenize, pos_tag, ne_chunk
input_str = “Bill works for Apple so he went to Boston for a conference.”
print ne_chunk(pos_tag(word_tokenize(input_str)))

Output:

 
(S (PERSON Bill/NNP) works/VBZ for/IN Apple/NNP so/IN he/PRP went/VBD to/TO (GPE Boston/NNP) for/IN a/DT conference/NN ./.)

Coreference resolution (anaphora resolution)

Pronouns and other referring expressions should be connected to the right individuals. Coreference resolution finds the mentions in a text that refer to the same real-world entity. For example, in the sentence, “Andrew said he would buy a car” the pronoun “he” refers to the same person, namely to “Andrew”. Coreference resolution tools: Stanford CoreNLPspaCyOpen CalaisApache OpenNLP are described in the “Coreference resolution” sheet of the table.

Name, Developer, Initial release

Features

Programming languages

License

Beautiful Anaphora Resolution Toolkit (BART),
Massimo Poesio, Simone Ponzetto, Yannick Versley, Johns Hopkins Summer Workshop, 2007

Incorporates a variety of machine learning approaches

REST-based web service

Apache license v2.0

Uses several machine learning toolkits

Exports the result as inline XML [36]

JavaRAP,
Long Qiu, 2004

Resolves third person pronouns, lexical anaphors, Identifies pleonastic pronouns

Java

-

Output format: anaphor - antecedent pairs; text with in-place substitutions

Accuracy: 57.9% (MUC6)

around 1,500 words per second [37]

A General Tool for Anaphora Resolution - GuiTAR,
University of Essex, 2007

Takes as an input a MAS-XML compliant file and adds new markup holding anaphoric information (elements).

Java

GPL License

Evaluation module is also included [38]

Reconcile,
Cornell University, The University of Utah, Lawrence Livermore National Labs, 2009

Runs on common data sets or unlabeled texts

Java

GPL License

Utilizes supervised machine learning classifiers from the Weka toolkit, the Berkeley Parser and Stanford Named Entity Recognition system.

MUC Score (MUC-6): recall 67.23; precision 65.54; F-measure 66.38

[39]

ARKref,
Brendan O'Connor, Michael Heilman, 2009

Deterministic, rule-based system

Java

GPL, MIT

Uses syntactic information from a constituent parser, and semantic information from an entity recognition component

Precision: 0.657617, recall: 0.552433, f1: 0.600454 [40]

Illinois Coreference Package,
Dan Roth, Eric Bengtson, 2008

Coreference related features:

Java

-

Gender and number match, WordNet relations including synonym, hypernym, antonym, and ACE entity types (person, organization, and geopolitical entity)

Features anaphoricity classifier [41]

Neural coref, Hugging Face, 2017

Uses neural nets and spaCy

Python

MIT License

Adds various speakers in the conversation when computing the features and resolving the coreferences [42]

Online demo

coreference resolution toolkit (cort),
Sebastian Martschat, Thierry Goeckel, Patrick Claus

Coreference resolution component

Python

MIT License

Error analysis component

Framework is based on latent variables allowing user to rapidly devise approaches to coreference resolution

Analyzes and visualizes errors made by coreference resolution systems [43]

CherryPicker,
Altaf Rahman, Vincent Ng, University of Texas at Dallas, 2009

Runs on Unix/Linux

-

Free for educational and research activities; may not be used for commercial or for-profit purposes

Cluster-ranking coreference model [44]

FreeLing,
TALP Research Center, Universitat Politècnica de Catalunya

Provides language analysis functionalities

C++

Affero GNU General Public License

Supports a variety of languages

Provides a command-line front-end

output formats :XML, JSON, CoNLL [45]

eXternally configurable REference and Non Named Entity Recognizer (xrenner),
Zeldes, Amir and Zhang, Shuo, Department of Linguistics at Georgetown University, 2016

Highly configurable, language independent coreferencer

Python

Apache License, Version 2.0

Pluggable classifiers – models can work 100% rule-based or use classifiers to rank rule outputs [46]

 

Coreference resolution tools

An example of coreference resolution using xrenner can be found here.

Collocation extraction

Collocations are word combinations occurring together more often than would be expected by chance. Collocation examples are “break the rules,” “free time,” “draw a conclusion,” “keep in mind,” “get ready,” and so on.

Name, Developer, Initial release

Features

Programming languages

License

TermeX,
Text Analysis and Knowledge Engineering Lab, University of Zagreb, 2009

UTF-8 formatted input text

Front end GUI

Freely available for research purposes upon request.

uses 14 association measures

Processes of n-grams up to length four

Hand selection of candidate n-grams for terminology lexica

Windows and Linux support

Fast and memory efficient processing of large corpora

 [47]

Collocate,
Athelstan

Uses statistical analyses (t-score, log likelihood, Mutual Information) and frequency information to present a list of candidate collocations

Application

Educational Price. Single user: $45

Searches for a word (phrase) within a set span (e.g. 4 words).

Site license (2-year, 15 users) $395

Produces an n-gram list

 

Extracts collocations using thresholds and either mutual information [48]

 

CollTerm,
Faculty of Humanities and Social Sciences at University of Zagreb; Research Institute for Artificial Intelligence at Romanian Academy

Results based on five different co-occurrence measures for multiword units (i.e. collocations) or distributional differences from large representative corpus by application of the TF-IDF measurement on single word units

Python

ApacheLicense_2.0

Language independent [49]

Collocation Extractor,
Dan Ștefănescu, 2012

Collocation words features in this approach:

Application

Restrictions: Inform Licensor, No Redistribution

– the distance between them is relatively constant;

User Nature: Academic, Commercial

– they appear together more often than expected by chance (Log-Likelihood)

 

Language independent

 

output annotation format: Text output with one collocation per line and annotations separated by tab [50]

 

ICE: Idiom and Collocation Extractor,
Verizon Labs, Computer Science Dept. University of Houston, 2017

Idioms  and collocations extraction

Python

Apache License 2.0

Two user-friendly formats

for identifying collocations offline and online Dictionary search, web search and substitution, and web search independence are used [51]

Text::NSP,
University of Minnesota, Carnegie Mellon University, University of Pittsburgh,  2000

Extract collocations and N-grams from text

Perl

GNU General Public License

Text::NSP::Measures module evaluates whether the co-occurrence of the words in an N-gram is purely by chance or statistically significant.

[52]

Collocation extraction tools

Example 13. Collocation extraction using ICE [51]

Code:

 
input=[“he and Chazz duel with all keys on the line.”]
from ICE import CollocationExtractor
extractor = CollocationExtractor.with_collocation_pipeline(“T1” , bing_key = “Temp”,pos_check = False)
print(extractor.get_collocations_of_length(input, length = 3))

Output:

 
[“on the line”]

Relationship extraction

Relationship extraction allows obtaining structured information from unstructured sources such as raw text. Strictly stated, it is identifying relations (e.g., acquisition, spouse, employment) among named entities (e.g., people, organizations, locations). For example, from the sentence “Mark and Emily married yesterday,” we can extract the information that Mark is Emily’s husband.

Name, Developer, Initial release

Features

Programming languages

License

ReVerb,
University of Washington's Turing Center

Automatically identifies and extracts binary relationships from English sentences

Java

ReVerb Software License Agreement

Designed for Web-scale information extraction where the target relations cannot be specified in advance and speed is important.

Inputs raw text

Outputs (argument1, relation phrase, argument2)  triples [53]

EXEMPLAR,
University of Alberta, 2013

Able to identify instances of any relation described in the text

Java

GNU General Public License v3.0

Extracts relations with two or more arguments

Role of an argument can be SUBJ (subject), DOBJ (direct object) and POBJ (prepositional object) [54]

Toolkit for Exploring Text for Relation Extraction (TETRE),
Alisson Oldoni, 2017

Accepts raw text as an input

Command line tool

MIT License

Optimized for the task of information extraction in a corpus composed of academic papers

Does data transformation, parsing, wraps tasks of third-party binaries

Outputs the relations in HTML and JSON [55]

TextRazor,
TextRazor, 2011

Uses state-of-the-art NLP and AI techniques

Python

Pricing

Processes thousands of words per second per core

PHP

Allows user to add product names, people, companies, custom classification rules and advanced linguistic patterns [56]

Java

 

REST API

Information Extraction in Python (IEPY),
Machinalis, 2014

Tries to predict relations using information provided by the user

Python

BSD 3-Clause "New" or "Revised" License

Aimed to perform Information Extraction (IE) on a large dataset

created for scientific experiments with new IE algorithms

Configured with convenient defaults [57]

Watson Natural Language Understanding,
IBM

Recognizes when two entities are related, and identify the type of relation

Curl

Pricing

Supported languages: Arabic, English, Korean, Spanish [58]

Node

 

Java

 

Python

MIT Information Extraction (MITIE),
E. Davis King, 2009

Binary relation detection

C, C++, Java, R, Python

Boost Software License

tools for training custom extractors and relation detectors

Uses of distributional word embeddings and structural Support Vector Machines

Offers several pre-trained models

Supports English, Spanish, and German [59]

An example of relationship extraction using NLTK can be found here.

Summary

In this post, we talked about text preprocessing and described its main steps including normalization, tokenization, stemming, lemmatization, chunking, part of speech tagging, named-entity recognition, coreference resolution, collocation extraction, and relationship extraction. We also discussed text preprocessing tools and examples. A comparative table was created.

After the text preprocessing is done, the result may be used for more complicated NLP tasks, for example, machine translation or natural language generation.

Resources:

  1. http://www.nltk.org/index.html
  2. http://textblob.readthedocs.io/en/dev/
  3. https://spacy.io/usage/facts-figures
  4. https://radimrehurek.com/gensim/index.html
  5. https://opennlp.apache.org/
  6. http://opennmt.net/
  7. https://gate.ac.uk/
  8. https://uima.apache.org/
  9. https://www.clips.uantwerpen.be/pages/MBSP#tokenizer
  10. https://rapidminer.com/
  11. http://mallet.cs.umass.edu/
  1. https://www.clips.uantwerpen.be/pages/pattern
  2. https://nlp.stanford.edu/software/tokenizer.html#About
  3. https://tartarus.org/martin/PorterStemmer/
  4. http://www.nltk.org/api/nltk.stem.html
  5. https://snowballstem.org/
  6. https://pypi.python.org/pypi/PyStemmer/1.0.1
  7. https://www.elastic.co/guide/en/elasticsearch/guide/current/hunspell.html
  8. https://lucene.apache.org/core/
  9. https://dkpro.github.io/dkpro-core/
  10. http://ucrel.lancs.ac.uk/claws/
  11. http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
  12. https://en.wikipedia.org/wiki/Shallow_parsing
  13. https://cogcomp.org/page/software_view/Chunker
  14. https://github.com/dstl/baleen
  15. https://github.com/CogComp/cogcomp-nlp/tree/master/ner
  16. https://github.com/lasigeBioTM/MER
  17. https://blog.paralleldots.com/product/dig-relevant-text-elements-entity-extraction-api/
  18. http://www.opencalais.com/about-open-calais/
  19. http://alias-i.com/lingpipe/index.html
  20. https://github.com/glample/tagger
  21. http://minorthird.sourceforge.net/old/doc/
  22. https://www.ibm.com/support/knowledgecenter/en/SS8NLW_10.0.0/com.ibm.watson.wex.aac.doc/aac-tasystemt.html
  23. https://www.poolparty.biz/
  24. https://www.basistech.com/text-analytics/rosette/entity-extractor/
  25. http://www.bart-coref.org/index.html
  26. https://wing.comp.nus.edu.sg/~qiu/NLPTools/JavaRAP.html
  27. http://cswww.essex.ac.uk/Research/nle/GuiTAR/
  28. https://www.cs.utah.edu/nlp/reconcile/
  29. https://github.com/brendano/arkref
  30. https://cogcomp.org/page/software_view/Coref
  31. https://medium.com/huggingface/state-of-the-art-neural-coreference-resolution-for-chatbots-3302365dcf30
  32. https://github.com/smartschat/cort
  33. http://www.hlt.utdallas.edu/~altaf/cherrypicker/
  34. http://nlp.lsi.upc.edu/freeling/
  35. https://corpling.uis.georgetown.edu/xrenner/#
  36. http://takelab.fer.hr/termex_s/
  37. https://www.athel.com/colloc.html
  38. http://linghub.lider-project.eu/metashare/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75
  39. http://ws.racai.ro:9191/narratives/batch2/Colloc.pdf
  40. http://www.aclweb.org/anthology/E17-3027
  41. https://metacpan.org/pod/Text::NSP
  42. https://github.com/knowitall/reverb
  43. https://github.com/U-Alberta/exemplar
  44. https://github.com/aoldoni/tetre
  45. https://www.textrazor.com/technology
  46. https://github.com/machinalis/iepy
  47. https://www.ibm.com/watson/developercloud/natural-language-understanding/api/v1/#relations
  48. https://github.com/mit-nlp/MITIE

Bio: Data Monsters help corporations and funded startups research, design, and develop real-time intelligent software to improve their business with data technologies.

Original. Reposted with permission.

Related: