Mining Twitter Data with Python Part 5: Data Visualisation Basics
Part 5 of this series takes on data visualization, as we look to make sense of our data and highlight interesting insights.
By Marco Bonzanini, Independent Data Science Consultant.
A picture is worth a thousand tweets: more often than not, designing a good visual representation of our data, can help us make sense of them and highlight interesting insights. After collecting and analysing Twitter data, the tutorial continues with some notions on data visualisation with Python.
Firstly, let’s install Vincent:
At this point, the file term_freq.json will contain a description of the plot that can be handed over to D3.js and Vega. A simple template (taken from Vincent resources) to visualise the plot:
Save the above HTML page as chart.html and run the simple Python web server:
Now you can open your browser at http://localhost:8888/chart.html and observe the result:
Click to enlarge.
Notice: you could save the HTML template directly from Python with:
but, at least in Python 3, the output is not a well formed HTML and you’d need to manually strip some characters.
With this procedure, we can plot many different types of charts with Vincent. Let’s take a moment to browse the docs and see its capabilities.
Time Series Visualisation
Another interesting aspect of analysing data from Twitter is the possibility to observe the distribution of tweets over time. In other words, if we organise the frequencies into temporal buckets, we could observe how Twitter users react to real-time events.
One of my favourite tools for data analysis with Python is Pandas, which also has a fairly decent support for time series. As an example, let’s track the hashtag #ITAvWAL to observe what happened during the first match.
Firstly, if we haven’t done it yet, we need to install Pandas:
In the main loop which reads all the tweets, we simply track the occurrences of the hashtag, i.e. we can refactor the code from the previous episodes into something similar to:
The last line is what allows us to track the frequencies over time. The series is re-sampled with intervals of 1 minute. This means all the tweets falling within a particular minute will be aggregated, more precisely they will be summed up, given how='sum'. The time index will not keep track of the seconds anymore. If there is no tweet in a particular minute, the fillna() function will fill the blanks with zeros.
To put the time series in a plot with Vincent:
Once you embed the time_chart.json file into the HTML template discussed above, you’ll see this output:
Click to enlarge.
The interesting moments of the match are observable from the spikes in the series. The first spike just before 1pm corresponds to the first Italian try. All the other spikes between 1:30 and 2:30pm correspond to Welsh tries and show the Welsh dominance during the second half. The match was over by 2:30, so after that Twitter went quiet.
Rather than just observing one sequence at a time, we could compare different series to observe how the matches has evolved. So let’s refactor the code for the time series, keeping track of the three different hashtags#ITAvWAL, #SCOvIRE and #ENGvFRA into the corresponding pandas.Series.
And the output:
Click to enlarge.
We can immediately observe when the different matches took place (approx 12:30-2:30, 2:30-4:30 and 5-7) and we can see how the last match had the all the attentions, especially in the end when the winner was revealed.
Bio: Marco Bonzanini is a Data Scientist based in London, UK. Active in the PyData community, he enjoys working in text analytics and data mining applications. He's the author of "Mastering Social Media Mining with Python" (Packt Publishing, July 2016).
Original. Reposted with permission.