Silver BlogText Mining on the Command Line

In this tutorial, I use raw bash commands and regex to process raw and messy JSON file and raw HTML page. The tutorial helps us understand the text processing mechanism under the hood.



By Sabber Ahamed, Computational Geophysicist and Machine Learning Enthusiast

Image

For the last couple of days, I have been thinking to write something about my recent experience on the usages of raw bash command and regex to mine text. Of course, there are more sophisticated tools and libraries online to process text without writing so many lines codes. For example, Python has built-in regex modules “re” that has many rich features to process text. ‘BeautifulSoup’ on the other hand has nice built-in features to web scare and clean raw web pages. I also use these tool for faster processing for large text corpus and when I feel lazy to write codes.

I always prefer to use the command line. I feel at home on the command line when I work with text processing and file management. In this tutorial, I use raw bash commands and regex to process raw and messy JSON file and raw HTML page. The tutorial helps us understand the text processing mechanism under the hood. I assume readers have the basic familiarity of regex and bash commands.

The first part of the tutorial, I show how bash commands like ‘grep,’ ‘sed,’ ‘tr,’ ‘column,’ ‘sort,’ ‘uniq,’ ‘awk’ can be used with regex to process raw and messy texts to get insight into the data. As an example, I use the Complete Works of Shakespeare provided by Project Gutenberg, which is in cooperation with World Library, Inc.

The whole work of Shakespeare work can be downloaded from the internet. I downloaded the Complete Works of William Shakespeare and put it into a text file: “Shakespeare.txt.” All right, let’s get started looking at the file size:

ls -lah shakes.txt
### Display:
-rw-r--r--@ 1 sabber  staff   5.6M Jun 15 09:35 shakes.txt


‘ls’ in bash command shows the list of files or folder in a certain directory. ‘-l’ flag displays file types, owner, group, size, date, and filename. ‘-a’ flag is used to display all the files including hidden ones. The flag ‘h’ -one of my favorite flags as it displays file size which is the human-readable format. Size of the shakes.txt is 5.6 megabyte.

Okay, now lets read the file to see whats in it. I use ‘less,’ and ‘tail’ commands to explorer the parts of the file. Name of the commands kind of tell about their functionalities. ‘less’ is used to view the contents of a text file one screen at a time. It is similar to ‘more’ but has the extended capability of allowing both forward and backward navigation through the file. ‘-N’ flag can be used to define line numbers. Similarly ‘tail’ shows the last couple of lines of the file.

less -N shakes.txt

### Display:
      1 
      2 Project Gutenberg’s The Complete Works of William Shakespeare, by William
      3 Shakespeare
      4
      5 This eBook is for the use of anyone anywhere in the United States and
      6 most other parts of the world at no cost and with almost no restrictions
      7 whatsoever.  You may copy it, give it away or re-use it under the terms


It looks like the first couple of lines are not Shakespeare work but some information about the Gutenberg’s project. Similarly, if we see the tail of the file, there are some lines unrelated to Shakespeare’s work. So I would delete the unnecessary tail part first then the header part of the file using ‘sed’ command like as below:

cat shakes.txt | sed -e '149260,149689d' | sed -e '1,141d' > shakes_new.txt


The above code snippets delete lines from 149260 to 149689 at the tail, then delete the first 141 lines. The unwanted lines include some information about legal rights, Gutenberg’s project and contents of the work. Alright, now let's do some statistics of the file using pipe and ‘awk’.

cat shakes_new.txt | wc | awk '{print "Lines: " $1 "\tWords: " $2 "\tCharacter: " $3 }'

### Display
Lines: 149118 Words: 956209 Character: 5827807


In the above code, I first extract the entire text of the file using ‘cat’ and then pipe into ‘wc’ to count the number of lines, words, and characters. Finally, I used ‘awk’ to display the information. The way of counting and displaying can be done in tons of other ways. Feel free to explore other possible options.
All right, its time clean and do some processing to the text for further analysis. Cleaning includes, convert the text to lower case, remove all digits, remove all punctuations, remove high-frequency words (stop words). Processings are not limited to these steps, and it depends on the purpose. Since I am only here to show necessary processing, I focus only on the processing mentioned above. First, convert all the uppercase characters/words to lowercase followed by removing all the digits and punctuations. To perform the processing, I use famous bash command ‘tr’ which translate or delete characters from a text document. It looks like the first couple of lines are not Shakespeare work but some information about the Gutenberg’s project. Similarly, if we see the tail of the file, there are some lines unrelated to Shakespeare’s work. So I would delete the unnecessary tail part first then the header part of the file using ‘sed’ command like as below:

cat shakes_new.txt | tr 'A-Z' 'a-z' | tr -d [:punct:] |  tr -d [:digit:] > shakes_new_cleaned.txt


The code snippet first converts the entire text to lower case and then remove all the punctuations and digits.

### Display before:
      1 From fairest creatures we desire increase,
      2 That thereby beauty’s rose might never die,
      3 But as the riper should by time decease,
      4 His tender heir might bear his memory:
      5 But thou contracted to thine own bright eyes,
      6 Feed’st thy light’s flame with self-substantial fuel,
      7 Making a famine where abundance lies,
      8 Thy self thy foe, to thy sweet self too cruel:
      9 Thou that art now the world’s fresh ornament,
     10 And only herald to the gaudy spring,
     11 Within thine own bud buriest thy content,
     12 And, tender churl, mak’st waste in niggarding:
     13   Pity the world, or else this glutton be,
     14   To eat the world’s due, by the grave and thee.

### Display after:
      1 from fairest creatures we desire increase
      2 that thereby beautys rose might never die
      3 but as the riper should by time decease
      4 his tender heir might bear his memory
      5 but thou contracted to thine own bright eyes
      6 feedst thy lights flame with selfsubstantial fuel
      7 making a famine where abundance lies
      8 thy self thy foe to thy sweet self too cruel
      9 thou that art now the worlds fresh ornament
     10 and only herald to the gaudy spring
     11 within thine own bud buriest thy content
     12 and tender churl makst waste in niggarding
     13   pity the world or else this glutton be
     14   to eat the worlds due by the grave and thee


Tokenization is one of the basic preprocessing in natural language processing. Tokenization can be both word or sentence wise. Here in this tutorial, I show how to tokenize the file. Tokenization in bash command can also be performed by the various command like ‘sed’, ‘awk’ and ‘tr’. I find ‘tr’ is very easy. In the code below, I first extract the cleaned text. Then I use ‘tr’ and its two flags: ‘s’ and ‘c’ to convert every word into lines. Details of the ‘tr’ and its various functionalities can be found in this StackExchange answers.

cat shakes_new_cleaned.txt | tr -sc ‘a-z’ ‘\12’ > shakes_tokenized.txt

### Display (First 10 words)
      1 from
      2 fairest
      3 creatures
      4 we
      5 desire
      6 increase
      7 that
      8 thereby
      9 beautys
     10 rose


Now that we have all the words tokenized, we can perform some analysis to get information something like, what is the most/least frequent word in the entire Shakespeare work. To do this, In the code below, I first use the ‘sort’ command to sort all the words first, then I use ‘uniq’ command with ‘-c’ flag to find out the frequency of each word. ‘uniq -c’ is same as ‘groupby’ in Pandas or SQL. Finally, sort the words with their frequency in either ascending (least frequent) or descending (most frequent) order.

cat shakes_tokenized.txt | sort | uniq -c | sort -nr > shakes_sorted_desc.txt

### Display

29768 the   28276 and  21868 i   20805 to  18650 of  15933 a      
14363 you   13191 my   11966 in  11760 that
cat shakes_tokenized.txt | sort | uniq -c | sort -n > shakes_sorted_asc.txt

### Display

1 aarons       1 abandoner    1 abatements     1 abatfowling          
1 abbominable  1 abaissiez    1 abashd         1 abates              
1 abbeys       1 abbots


The above results reveal some interesting observations. For example, the ten most frequent words are either pronouns or prepositions or conjunctions. If we want to find out more abstract information of the document, we might need to remove all the stop words — all the prepositions, pronouns, conjunctions, modal verbs. It also depends on the purpose of the object. One might be interested only in prepositions. In this case, it’s okay to keep all the prepositions. On the other hand, least frequent words are abandoner, abatements, abashd. A linguistic or literature student may find better intuitions from these simple analytics in their perspective.

In the next step, I show to uses ‘awk’ to remove all the stop words on the command line. In this tutorial, I used NLTK’s list of English stopwords. I also have added a couple more words to the list. Details of the following codes can be found in this StackOverflow answers. Details of the different options variables of awk can be also found from the manual of awk (‘man awk’ on the command line)

awk ‘FNR==NR{for(i=1;i<=NF;i++)w[$i];next}(!($2 in w))’ stop_words.txt shakes_tokenized.txt > shakes_stopwords_removed.txt


Alright, after removing the stop words lets sort the words in ascending and descending order like as above.

cat shakes_stopwords_removed.txt | sort | uniq -c | sort -nr > shakes_sorted_desc.txt

### Display most frequent

3159 lord   2959 good  2924 king  2900 sir
2634 come   2612 well  2479 would 2266 love
2231 let    2188 enter

cat shakes_stopwords_removed.txt | sort | uniq -c | sort -n > shakes_sorted_asc.txt

### Display least frquent
1 aarons       1 abandoner    1 abatements     1 abatfowling          
1 abbominable  1 abaissiez    1 abashd         1 abates              
1 abbeys       1 abbots


After removing the stop words, we see the most frequent word used by Shakespeare in this corpus, is the ‘Lord’ followed by ‘good’. The word ‘Love’ is also included in top most frequent words. The least frequent words remain same.

As we are done with some necessary processing and cleaning, in the next tutorial I will discuss how we can perform some analytics. Until then if you have any questions feel free to ask. Please make comments if you see any typos, mistakes or you have better suggestions. You can reach out to me:

Email: sabbers@gmail.com
LinkedIn: https://www.linkedin.com/in/sabber-ahamed/
Github: https://github.com/msahamed
Medium: https://medium.com/@sabber/

 
Bio: Sabber Ahamed is the Founder of xoolooloo.com. Computational Geophysicist and Machine Learning Enthusiast.

Original. Reposted with permission.

Related: