KDnuggets Home » News » 2017 » Mar » Tutorials, Overviews » A Beginner’s Guide to Tweet Analytics with Pandas ( 17:n13 )

A Beginner’s Guide to Tweet Analytics with Pandas


 
 
http likes 223

Unlike a lot of other tutorials which often pull from the real-time Twitter API, we will be using the downloadable Twitter Analytics data, and most of what we do will be done in Pandas.



Twitter provides access to analytics for all of its users, but I am assuming relatively few vanilla tweeples pay much attention to its existence. There are a variety of other services which can help perform tweet and audience analytics, and further analysis such as that related to geographic and natural language processing, but when paired with some simple Python, the Twitter-supplied data can be incredibly useful.

This is a simple guide to getting your hands a bit dirty doing analysis on your own in Python. Unlike a lot of other tutorials which often pull from the real-time Twitter API, we will be using the downloadable Twitter Analytics data, and most of what we do will be done in Pandas.

Before we get started, let's get the obligatory imports out of the way.

Get and Inspect the Data

 
First we need the data. This part is easy enough; go to Twitter, click on the upper right menu (your profile pic), select Analytics, choose the Tweets tab along the top, use the date range pickers to select a time period, and choose Export Data. It doesn't matter how much data you use; our simple example will work with any amount. I chose the default, Past 28 Days.


Get that Twitter analytics data.

Once we have the CSV file, we will want to load it into a Pandas DataFrame for analysis.

Don't mind all the dropped columns; while a lot of what is there is useful for our analysis -- tweet text, time, impressions, retweets, etc. -- many are not -- all the promoted things -- and so we will just omit them from the start.

As we would with any data analysis project, next we have a look at the data.

Running:

tells us that I have tweeted a measly 95 times in the past 4 weeks. Not a very large dataset, and we probably would not want to make any inferences based on our findings, but a good enough toy set to start out with.

Let's see what useful analytics we can pull out of this.

Basic Tweet Stats

 
So, given what data is shown in the output of running head() on the dataset above, and having a rough intuition of what tweet metrics would be useful, we will grab the following stats:

  • Retweets - Mean RTs per tweet & top 5 RTed tweets
  • Likes - Mean likes per tweet & top 5 liked tweets
  • Impressions - Mean impressions per tweet & top 5 tweets with most impressions

Total tweets this period: 95 

Mean retweets: 1.72 

Top 5 RTed tweets:
------------------
A PyTorch IPython Notebook tutorial on #deeplearning, with an emphasis on #NaturalLanguageProcessing https://t.co/bxiBD42T7E https://t.co/awTsZA8R9v  -  26
On the Origin of #DeepLearning https://t.co/oe7r43HHVS #NeuralNetworks #arxiv https://t.co/BIcba61FR9  -  25
7 MORE Steps to Mastering #MachineLearning With #Python https://t.co/5yAjeUpCfS https://t.co/juWH0rQaNR  -  16
Every Intro to #DataScience Course on the Internet, Ranked https://t.co/rQG7Higk6b https://t.co/b6VveKfJxD  -  8
Pandas & Seaborn - A guide to handle & visualize #data elegantly @tryolabs https://t.co/LPq2q8k1i1 #Python #dataviz https://t.co/k2IoWsttXM  -  7

Mean likes: 3.17 

Top 5 liked tweets:
-------------------
A PyTorch IPython Notebook tutorial on #deeplearning, with an emphasis on #NaturalLanguageProcessing https://t.co/bxiBD42T7E https://t.co/awTsZA8R9v  -  52
7 MORE Steps to Mastering #MachineLearning With #Python https://t.co/5yAjeUpCfS https://t.co/juWH0rQaNR  -  37
On the Origin of #DeepLearning https://t.co/oe7r43HHVS #NeuralNetworks #arxiv https://t.co/BIcba61FR9  -  37
Pandas & Seaborn - A guide to handle & visualize #data elegantly @tryolabs https://t.co/LPq2q8k1i1 #Python #dataviz https://t.co/k2IoWsttXM  -  16
I've been reposted on @YhatHQ - The Current State of Automated #MachineLearning https://t.co/ggAW1Hrmxk https://t.co/8030HhAMMA  -  14

Mean impressions: 674.39

Top 5 tweets with most impressions:
-----------------------------------
On the Origin of #DeepLearning https://t.co/oe7r43HHVS #NeuralNetworks #arxiv https://t.co/BIcba61FR9 - 6409
A PyTorch IPython Notebook tutorial on #deeplearning, with an emphasis on #NaturalLanguageProcessing https://t.co/bxiBD42T7E https://t.co/awTsZA8R9v - 5684
7 MORE Steps to Mastering #MachineLearning With #Python https://t.co/5yAjeUpCfS https://t.co/juWH0rQaNR - 4374
I've been reposted on @YhatHQ - The Current State of Automated #MachineLearning https://t.co/ggAW1Hrmxk https://t.co/8030HhAMMA - 2270
Pandas & Seaborn - A guide to handle & visualize #data elegantly @tryolabs https://t.co/LPq2q8k1i1 #Python #dataviz https://t.co/k2IoWsttXM - 1818


I won't bother with any analysis of these metrics. Needless to say, I should step my social media game up.

Top #Hashtags and @Mentions

 
It's no secret that hashtags play an important role in Twitter, and mentions can also help grow your network and influence. Together they help put the 'social' in social networking, transforming platforms like Twitter from passive experiences to very active ones. With that, getting a handle on the most social aspect of this social network can be a helpful endeavour.

Top 10 hashtags:
----------------
machinelearning - 21
deeplearning - 16
python - 15
datascience - 10
neuralnetworks - 10
ai - 5
data - 4
datascientist - 3
tensorflow - 3
rstats - 3

Top 10 mentions:
----------------
kdnuggets - 3
francescoai - 2
jakevdp - 2
quora - 2
yhathq - 2
noahmp - 1
udacity - 1
clavitolo - 1
monkeylearn - 1
nicholashould - 1


Putting aside some evident bumps like punctuation being removed from tweet text prior to checking Twitter handles (this could be a problem if you have tweeps named both Francesco_AI and FrancescoAI), this works and is at least relatively Pythonic (though I'm sure it could be more so).

Time-series Analysis

 
Finally, let's have a look at some very basic temporal data. We will check mean impressions for tweets based -- independently -- on both the hour of day and day of week that they are tweeted. I caution (once gain) that this is based on very little data, and so nothing useful will likely be gleaned. However, given much larger amounts of tweet data, entire social media campaigns are planned.

While this is based on impressions, it could just as reasonably (and easily changed to) be based on engagements, or RTs, or whatever else you pleased. Working in advertising, and promoting tweets? Maybe you are more interested in some of those promotion* metrics we hacked off the dataset at the start.

We have to convert the Twitter supplied date field to a legitimate Python datetime object, bin the data based on which hourly slot it falls into, identify days of week, and then capture this data in a couple of additional columns in the DataFrame, which we will pillage for stats afterward.

Average impressions per tweet by hour tweeted:
----------------------------------------------
0 - 1 : 141 => 1 tweets
9 - 10 : 445 => 10 tweets
10 - 11 : 611 => 9 tweets
11 - 12 : 1319 => 10 tweets
12 - 13 : 528 => 10 tweets
13 - 14 : 448 => 11 tweets
14 - 15 : 464 => 16 tweets
15 - 16 : 763 => 8 tweets
17 - 18 : 634 => 9 tweets
18 - 19 : 1306 => 8 tweets
19 - 20 : 454 => 1 tweets
21 - 22 : 186 => 1 tweets
23 - 24 : 208 => 1 tweets

Average impressions per tweet by day of week tweeted:
-----------------------------------------------------
Mon : 475 => 20  tweets
Tue : 568 => 18  tweets
Wed : 1418 => 18  tweets
Thu : 545 => 17  tweets
Fri : 432 => 22  tweets


It seems I tweet at rather consistent times of day. It also seems that my Wednesday tweets, 11 AM tweets, and 6 PM tweets are my bread and butter. Of course, this is based on 95 tweets, and so is meaningless and inconclusive. However, after performing these same steps on some considerably larger sets of data, some interesting trends have been observed which may help lead to business decisions. All from some simple Python.

While not earth-shattering, our simple Pandas-based Twitter analytics code is enough to get us thinking about how we may better use social media. Applied to the right data, elementary scripts can be quite powerful.

Related: