This blog is about text mining and visualization of records from London's Old Bailey prison using Mathematica
criminalintent.org, January 25th, 2011
The Proceedings of the Old Bailey contains about 120 million words. Trial records can be searched to locate individual instances of particular words (like 'brimstone'), but what do you do when you are curious about the kinds of patterns to be found across the archive as a whole? That's where data mining comes in.
While glancing through some records that came up while running one of our programs, I noticed a reference to a scarlet cloak. I got to wondering what kinds of colour words appear in the OB, and how frequent each one is. The two graphs below come from a Mathematica hack that took about two hours to do. First, I pulled out all of the words in the Old Bailey that also appear in an English dictionary. I also generated a list of dictionary words for spectral colours, then took the intersection of the two sets. Both of these operations were made easy by access to Wolfram Research's curated data sets from directly within Mathematica. I then used Wolfram Alpha to look up an RGB colour value for each color word, again directly from Mathematica, and plotted each colour in an RGB space as a little cuboid. That's the first image. The second image shows what proportion of all trials that include colour words mention a particular colour. That's the second image.