The Mueller Report Word Cloud: A brief tutorial in R
Word clouds are simple visual summaries of the mostly frequently used words in a text, presenting essentially the same information as a histogram but are somewhat less precise and vastly more eye-catching. Get a quick sense of the themes in the recently released Mueller Report and its 448 pages of legal content.
By Rick Klein, Université Grenoble Alpes
This is a quick and dirty tutorial for generating a word cloud from the Mueller report using R. Code is available on GitHub (https://github.com/raklein/mueller-wordcloud), and all credit goes to Alboukadel Kassambara’s word cloud tutorial (http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know) because I simply implemented an abbreviated version and added a few lines of code.
Word clouds are simple visual summaries of the mostly frequently used words in a text, after removing common and uninformative words. They present essentially the same information as a histogram but are somewhat less precise and vastly more eye-catching. They give you a quick sense for the themes in a text, which is useful when you’re dealing with 448 pages of legal content.
First, load our required libraries. We’ll be using:
library("pdftools") # to convert pdf to text library("tm") # tools to work with text library("wordcloud") # generate the wordcloud library("RColorBrewer") # color palette library("Cairo") # antialiasing for better graphics
Now downloaded the report and place it in your working directory (e.g., https://cdn.cnn.com/cnn/2019/images/04/18/mueller-report-searchable.pdf)
Convert the pdf to text, and store it as a character vector. Mistakes will be made here, but it worked OK for this document.
tex <- pdf_text("mueller-report-searchable.pdf")
Convert that text string to a corpus so the tm package can work with it
docs <- Corpus(VectorSource(tex))
Here we implement several steps to "tidy up" the corpus and remove common words that wouldn’t be very informative in a word cloud for our purposes. These are straight from Kassambara’s tutorial:
Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
Remove numbers
docs <- tm_map(docs, removeNumbers)
Remove common English words
docs <- tm_map(docs, removeWords, stopwords("english"))
Specify any additional words you want removed
Revisit this line after doing the visualization and add any extras
docs <- tm_map(docs, removeWords, c("president", "presidents", "also"))
Remove punctuation
docs <- tm_map(docs, removePunctuation)
Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
There's a randomization component (mostly in terms of layout), so let's lock that randomization so we can reproduce the exact same figure if we want to
set.seed(1)
Before making the figure, we're going to specify a graphics device to output to
I'm using Cairo because it adds antialiasing for higher quality
CairoPNG("wordcloud.png", width = 450, height = 450)
Make wordcloud
wordcloud(words = docs, scale=c(5,0.5), # size difference between largest and smallest words min.freq = 1, max.words = 150, # how many words to plot random.order=FALSE, rot.per=0.35, # what % of words will be rotated colors=brewer.pal(8, "Dark2")) # specify the color pallette
Turn off the Cairo graphics device, which effectively saves the wordcloud as a .png
dev.off()
Done! You should have a beautiful, semi-informative figure. Note that this figure was originally made in about 20 minutes, and I’ve received lots of suggestions for ways to make it better. Experiment!
Bio: Rick Klein is a post-doctoral researcher at the Université Grenoble Alpes. Interested in the reproducibility of psychological science across contexts.
Related:
- Machine Learning Finds “Fake News” with 88% Accuracy
- Generating Text with RNNs in 4 Lines of Code
- Find Out What Celebrities Tweet About the Most