Mining Twitter Data with Python Part 3: Term Frequencies
Part 3 of this 7 part series focusing on mining Twitter data discusses the analysis of term frequencies for meaningful term extraction.
By Marco Bonzanini, Independent Data Science Consultant.
This is the third part in a series of articles about data mining on Twitter. After collecting data and pre-processing some text, we are ready for some basic analysis. In this article, we’ll discuss the analysis of term frequencies to extract meaningful terms from our tweets.
Assuming we have collected a list of tweets (see Part 1 of the tutorial), the first exploratory analysis that we can perform is a simple word count. In this way, we can observe what are the terms most commonly used in the data set. In this example, I’ll use the set of my tweets, so the most frequent words should correspond to the topics I discuss (not necessarily, but bear with be for a couple of paragraphs).
We can use a custom tokeniser to split the tweets into a list of terms. The following code uses the preprocess() function described in Part 2 of the tutorial, in order to capture Twitter-specific aspects of the text, such as #hashtags, @-mentions, emoticons and URLs. In order to keep track of the frequencies while we are processing the tweets, we can usecollections.Counter() which internally is a dictionary (term: count) with some useful methods like most_common():
The above code will produce some unimpressive results:
As you can see, the most frequent words (or should I say, tokens), are not exactly meaningful.
In every language, some words are particularly common. While their use in the language is crucial, they don’t usually convey a particular meaning, especially if taken out of context. This is the case of articles, conjunctions, some adverbs, etc. which are commonly called stop-words. In the example above, we can see three common stop-words – to, andand on. Stop-word removal is one important step that should be considered during the pre-processing stages. One can build a custom list of stop-words, or use available lists (e.g. NLTK provides a simple list for English stop-words).
Given the nature of our data and our tokenisation, we should also be careful with all the punctuation marks and with terms like RT (used for re-tweets) and via (used to mention the original author of an article or a re-tweet), which are not in the default stop-word list.
We can now substitute the variable terms_all in the first example with something like:
After counting, sorting the terms and printing the top 5, this is the result:
More term filters
Besides stop-word removal, we can further customise the list of terms/tokens we are interested in. Here you have some examples that you can embed in the first fragment of code:
After counting and sorting, these are my most commonly used hashtags:
and these are my most commonly used terms:
While the other frequent terms represent a clear topic, more often than not simple term frequencies don’t give us a deep explanation of what the text is about. To put things in context, let’s consider sequences of two terms (a.k.a. bigrams).
The bigrams() function from NLTK will take a list of tokens and produce a list of tuples using adjacent tokens. Notice that we could use terms_allto compute the bigrams, but we would probably end up with a lot of garbage. In case we decide to analyse longer n-grams (sequences of ntokens), it could make sense to keep the stop-words, just in case we want to capture phrases like “to be or not to be”.
So after counting and sorting the bigrams, this is the result:
So apparently I tweet about nice articles (I wouldn't bother sharing the boring ones) and extractive summarisation (the topic of my PhD dissertation). This also sounds about right.
This article has built on top of the previous ones to discuss some basis for extracting interesting terms from a data set of tweets, by using simple term frequencies, stop-word removal and n-grams. While these approaches are extremely simple to implement, they are quite useful to have a bird’s eye view on the data. We have used some components of NLTK (introduced in a previous article), so we don’t have to re-invent the wheel.
Bio: Marco Bonzanini is a Data Scientist based in London, UK. Active in the PyData community, he enjoys working in text analytics and data mining applications. He's the author of "Mastering Social Media Mining with Python" (Packt Publishing, July 2016).
Original. Reposted with permission.
- Mining Twitter Data with Python Part 1: Collecting Data
- Mining Twitter Data with Python Part 2: Text Pre-processing
- Tutorial: Building a Twitter Sentiment Analysis Process
- Audio Data Analysis Using Deep Learning with Python (Part 2)
- Audio Data Analysis Using Deep Learning with Python (Part 1)
- A Complete Guide To Survival Analysis In Python, part 3
- A Complete Guide To Survival Analysis In Python, part 2
- A Complete Guide To Survival Analysis In Python, part 1
- Text Mining in Python: Steps and Examples