Data Mining and Text Analytics of World Cup 2014

Explore how text analysis techniques to dig into some of the data in a series of blog posts, focusing on matches and their events, tweets languages, tweets volumes for different teams and sentiment analysis.

By Parsa Ghaffari (AYLIEN), Jan 2015.

The following blog is a summarized version of a 2 part blog series on our Text Analysis Blog.

The FIFA World Cup is without doubt the biggest sporting event in the world. During the 2014 World Cup, millions of fans and viewers from all over the globe used Social Media to share their thoughts and emotions about the games, teams and players and thus created massive amounts of data by doing so.

Throughout the tournament, Facebook saw a record-breaking 3 billion interactions and Twitter saw a whopping 672 million Tweets about the World Cup. At AYLIEN we decided to collect, analyze and visualize some of this data to look for interesting insights and correlations.

Data and Tools

  • tweets.csv: ~30 million Tweets collected between June 6th-July 14th.
  • matches.csv: Information about the 64 matches, such as match times and results from the World Cup json project.
  • events.csv: Information about match events such as goals, substitutions and cards from the World Cup json project.

Tools: AYLIEN Text Analysis API for Sentiment Analysis, RapidMiner for data processing and Tableau for interactive visualizations.

We started off by taking a look at the matches and their events, such as goals, substitutions and red and yellow cards and graphed them to see how things developed as the tournament closed out.

World Cup 2014: Goals, Subs, Red
Fig 1. Word Cup 2014, Goals, Substitutions, Red and Yellow Cards

An interesting observation here is the fact that the matches with 4 or more yellow cards tended to increase in the later stages of the tournament and the number of Red cards decreased dramatically as players began to fear missing out on later games due to suspension.

Tweets by fans
We wanted to try and understand what nations Tweeted the most in support of their teams and what languages most Tweets were written in. Looking at four of the teams, who made it relatively far in the competition, the Netherlands were certainly the most vocal. You can also see how Germany started to gather more vocal followers as they progressed through the tournament.

World Cup 2014: Goals, Subs, Red<br / Fig 2. Word Cup 2014, Tweets by Fans

When it came to languages, Tweets in the English language accounted for over 50% of all Tweets and Spanish came in as the second most popular language, which isn’t too surprising.

World Cup 2014: Tweets by Language
Fig 3. Word Cup 2014, Tweets by Language

Tweets over Time
Plotting the total volume of Tweets over time showed a repeating pattern of spikes appearing at match times and also at times when a major event had occurred (such as the elimination of a team, qualification for the next round, or shocking results).

We looked at how the volume of Tweets in a language was affected by matches and critical events for countries where that language is spoken. We graphed a number of these in our original post (link) but let's look at Tweets written in Portuguese as an example.

World Cup 2014: Tweets over time
Fig 4. Word Cup 2014, Tweets over time

The volume of Tweets in Portuguese Tweets increased dramatically following Brazils 1-1 draw against Chile and reached its highest point after a shocking loss conceding 7 goals to Germany.

Sentiment Analysis
We decided to dive a little deeper into the Tweets collected to try and better understand fans opinions. We looked into the polarity values (“positive” or “negative”) of Tweets collected, to see how they fared for different entities and how they changed over time, as a result of various events.

Note: we only analyzed English Tweets for the following examples, which introduces a sampling bias.

By tracking the sentiment of Tweets with a country's official Hashtag we determined that the most popular teams were the USA, Germany and Brazil with Greece being the least popular. When it came to the most popular players sentiment towards Neymar, Messi, and Howard were highest and I bet you can guess who had the most negative Tweets about them….Yep, Luis Suarez. You can read more about that here.

Luis Suarez

Argentine Luis Suarez was accused of biting Italian defender Giorgio Chiellini. One of the most shocking events of the entire World Cup, was followed by a wave of negative comments and feedback from Social Media. Suarez issued an apology on June 30th, which seems to have been satisfactory for the Twitter community (take note PR people!):

World Cup 2014: Suarez biting
Fig 5. Word Cup 2014, Suarez biting scandal

Sentiment over time

Naturally enough, different events concerning players or teams affect how people think and talk about them. Using polarity analysis, we can get an idea of people’s reaction to various events, which provides valuable insight into the opinions of fans.

Below we graphed the Sentiment of Tweets related to Brazil over the course of the tournament. It’s interesting to note how certain events caused a shift in Sentiment and how over all more and more negativity seeped into the voice of the Brazilian fans.

World Cup 2014: Sentiment over time Fig 6. Word Cup 2014, Sentiment over time

You can read the entire series in full which has more graphs and visualizations here.

About AYLIEN: We are a Text Analysis company who have built a Text Analysis API, among other products, designed to help developers, data scientists, business people and academics extract meaning from text. You can try out our API by signing up for an account or visiting our sandbox.