Data Mining for Predictive Social Network Analysis – Brazil Elections Case Study

Here are the techniques used for a proof-of-concept that effectively analyzed Twitter Trend Topics to predict regional voting patterns in the 2014 Brazilian presidential election.

Network Topology

Network topology is essentially the arrangement of the various elements (links, nodes, etc.) of a network. For the social network we are analyzing, the network topology does not change dramatically across the 3 days, since the nodes of the network (i.e., the 14 cities) remain fixed. However, differences can be detected in the weights of the links between the nodes, since the number of common trend topics between cities varies across the 3 days, as shown in the comparison below of the network topology on Day 24 vs. Day 25.

Predicting Election Results Using Twitter Trend Topic Data

To assist us in predicting election results, we consider not only the trend topics in common between cities, but also how the content of those topics relates to likely support for each of the two principal political parties; i.e., Partido dos Trabalhadores (PT) and Partido da Social Democracia Brasileira (PSDB).

First, I created a list of words and phrases perceived to indicate a positive leaning toward, or support for, one of the parties. (Populating this list is admittedly a highly complex task. In the context of this proof of concept, I deliberately took a simplified approach. If anything, this makes the caliber of the results all the more intriguing, since a more highly tuned list of terms and phrases would presumably further improve the accuracy of the results.)

Then, for each node, I count:

  • the number of its links which include terms that indicated support for PT
  • the number of its links which include terms that indicated support for PSDB

Using the city of Fortazela again as an example, I ended up with counts of:

Fortaleza['PT'] = 56
Fortaleza['PDSB'] = 37

We thereby draw the conclusion that Fortaleza residents have an overall preference for Partido dos Trabalhadores (PT).

Results and Conclusions

Based on this algorithm, the analysis yields results that are surprisingly similar to the actual election results, especially when one considers the general simplicity of our approach. Here’s a comparison of the predictive results based on the Twitter Trend Topic data as compared with the real election results (red is used to represent Partido dos Trabalhadores and blue is used to represent Partido da Social Democracia Brasileira):

social network analysis and data mining

Improved scientific rigor, as well as more sophisticated algorithms and metrics, would undoubtedly improve the results even further.

Here are a few metrics, for example, that could be used to infer a node’s importance or influence, which could in turn inform the type of predictive analysis described in this article:

  • Node centrality. Numerous node centrality measures exist that can be employed to help identify the most important or influential nodes in a network. Betweenness centrality, for example, considers a node highly important if it forms bridges between many other nodes. The eigenvalue centrality, on the other hand, based a node’s importance on the number of other highly important nodes that link to it.
  • Clustering coefficient. The clustering coefficient of a node measures the extent to which a node’s “neighbors” are connected to one other. This is another measure that can be relevant to evaluating a node’s presumed degree of influence on its neighboring nodes.
  • Degree centrality. Degree centrality is based on the number of links (i.e., connections) to a node. This is one of the simplest measures of a node’s “significance” within a network.

But even without that level of sophistication, the results achieved with this simple proof-of-concept provided a compelling demonstration of effective predictive analysis using Twitter Trend Topic data. There is clearly the potential to take social media data analysis even further in the future.

Bio: Elder Santos is an accomplished software engineer, specializing in machine learning and data science. He has expertise in the full life cycle of the software design process.