KDnuggets Home » News » 2019 » Apr » Tutorials, Overviews » The Mueller Report Word Cloud: A brief tutorial in R ( 19:n16 )

The Mueller Report Word Cloud: A brief tutorial in R


Word clouds are simple visual summaries of the mostly frequently used words in a text, presenting essentially the same information as a histogram but are somewhat less precise and vastly more eye-catching. Get a quick sense of the themes in the recently released Mueller Report and its 448 pages of legal content.



By Rick Klein, Université Grenoble Alpes

This is a quick and dirty tutorial for generating a word cloud from the Mueller report using R. Code is available on GitHub (https://github.com/raklein/mueller-wordcloud), and all credit goes to Alboukadel Kassambara’s word cloud tutorial  (http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know) because I simply implemented an abbreviated version and added a few lines of code.

Word clouds are simple visual summaries of the mostly frequently used words in a text, after removing common and uninformative words. They present essentially the same information as a histogram but are somewhat less precise and vastly more eye-catching. They give you a quick sense for the themes in a text, which is useful when you’re dealing with 448 pages of legal content.

First, load our required libraries. We’ll be using:

library("pdftools") # to convert pdf to text
library("tm") # tools to work with text
library("wordcloud") # generate the wordcloud
library("RColorBrewer") # color palette 
library("Cairo") # antialiasing for better graphics


Now downloaded the report and place it in your working directory (e.g., https://cdn.cnn.com/cnn/2019/images/04/18/mueller-report-searchable.pdf)

Convert the pdf to text, and store it as a character vector. Mistakes will be made here, but it worked OK for this document.

tex <- pdf_text("mueller-report-searchable.pdf")


Convert that text string to a corpus so the tm package can work with it

docs <- Corpus(VectorSource(tex))


Here we implement several steps to "tidy up" the corpus and remove common words that wouldn’t be very informative in a word cloud for our purposes. These are straight from Kassambara’s tutorial:

Convert the text to lower case

docs <- tm_map(docs, content_transformer(tolower))


Remove numbers

docs <- tm_map(docs, removeNumbers)


Remove common English words

docs <- tm_map(docs, removeWords, stopwords("english"))


Specify any additional words you want removed
Revisit this line after doing the visualization and add any extras

docs <- tm_map(docs, removeWords, c("president", "presidents", "also"))


Remove punctuation

docs <- tm_map(docs, removePunctuation)


Eliminate extra white spaces

docs <- tm_map(docs, stripWhitespace)


There's a randomization component (mostly in terms of layout), so let's lock that randomization so we can reproduce the exact same figure if we want to

set.seed(1)


Before making the figure, we're going to specify a graphics device to output to
I'm using Cairo because it adds antialiasing for higher quality

CairoPNG("wordcloud.png", width = 450, height = 450)


Make wordcloud

wordcloud(words = docs, 
                  scale=c(5,0.5), # size difference between largest and smallest words
                  min.freq = 1,
                  max.words = 150, # how many words to plot
                  random.order=FALSE, 
                  rot.per=0.35, # what % of words will be rotated
                  colors=brewer.pal(8, "Dark2")) # specify the color pallette


Turn off the Cairo graphics device, which effectively saves the wordcloud as a .png

dev.off()


Done! You should have a beautiful, semi-informative figure. Note that this figure was originally made in about 20 minutes, and I’ve received lots of suggestions for ways to make it better. Experiment!

 
Bio: Rick Klein is a post-doctoral researcher at the Université Grenoble Alpes. Interested in the reproducibility of psychological science across contexts.

Related:


Sign Up

By subscribing you accept KDnuggets Privacy Policy