Elementary, My Dear Watson! An Introduction to Text Analytics via Sherlock Holmes
Want to learn about the field of text mining, go on an adventure with Sherlock & Watson. Here you will find what are different sub-domains of text mining along with a practical example.
By Michael Fire, U. of Washington.
As a data scientist, analyzing text corpora is one of the more interesting tasks I like to do. By analyzing various text sources, we can learn a lot about the world around us. In this blog you’ll find some of my favorite resources to learn about text analysis and my own example tutorial using the Sherlock Holmes stories.
Here are some of my favorite blog posts that do this:
- Using sentiment analysis, we can learn about people’s mood swings throughout the weekday from analyzing their tweets on Twitter.
- Using the Bag-of-words and TF-IDF models, we can classify documents.
- Using topic model algorithms, we can “uncover the hidden thematic structure in document collections.”
- Using Named Entity Recognition (NER), we can learn about connections among entities.
- Using Word2Vec, we can infer the gender of a blog author.
Recently, someone asked me, “How can I start learning about NLP?” I recommended that he start reading about the subjects mentioned above and try to solve several of Kaggle’s competitions, such as Bag of Words Meets Bags of Popcorn, and StumbleUpon Evergreen Classification Challenge. Additionally, I decided to write the following iPython notebook, which hopefully will help him and other developers like him to enter the world of NLP.
In this notebook, “Text Analytics Tutorial using Sherlock Holmes Stories,” I present a practical way to learn how to analyze large text collections. We start with downloading Sir Arthur Conan Doyle‘s collection of Sherlock Holmes stories. We then utilize Python Regular Expression and NLTK packages to perform a very simple analysis of the Sherlock stories, such as counting the number of sentences and counting the number of times a specific word appears in all the stories. We then move to perform some NER. I demonstrate how it is possible to reconstruct the social network of Sherlock, using the characters names’ from Wikipedia or by using Stanford Named Entity Recognizer software. Next, we move to topic models, using GraphLab Create’s Topic Model Toolkit and pyLDAvis, where I demonstrate how to analyze paragraphs in the Sherlock Holmes stories. Lastly, I show how Word2Vec can be used to find similarly styled paragraphs.
Topic Model for Sherlock Holmes stories.
My main goal in writing this notebook is to give some practical (and hopefully interesting) examples to show how it’s really easy and straightforward to perform NLP with today’s existing set of tools. I really hope that after reading this tutorial, you will try to do some NLP yourself and discover some intriguing insights on different datasets.
Bio: Michael Fire is a Washington Research Foundation Innovation Postdoctoral Fellow in Data Science and U. of Washington Moore/Sloan Data Science Postdoctoral Fellow. He received Ph.D. in Information System Engineering from Ben-Gurion University, where he won the Kreitman Prize for excellence.