8 Things to Check when you analyze Twitter data

A review of biases and issues on large scale studies of human behavior in social media discussed by a recent paper published on Science.

trumanIn 1948, the Chicago Tribune did telephone surveys to predict the United States presidential elections. However, this survey undersampled the number of supporters of Truman, who turned out to be the winner. The paper published “Dewey Defeats Truman” headline the day after Truman won, which became one of the most famous erroneous headlines in newspaper history.

With the explosion of social media data, it seems that ongoing research based on a tremendous amount of data about public opinions could be unbiased.  The study by two computer scientists, Derek Ruths from McGill and Jurgen Pfeffer from CMU, published in Science warns that large-scale studies of human behavior in social media may be misleading and need to be held to higher methodological standards.

Check if your research has one of the issues listed below.

Social Media data may not be an accurate representation of human populations.
  • Population bias. Substantial population biases vary across different social media platforms. Researchers also have no idea when and how social media providers change the sampling and filtering of their data streams.
  • Human behavior and online platform design. To increase platform use and adoption, the designers of social media platforms would incorporate human behavior patterns, such as the friend of a friend being a friend, in their link suggestion algorithms. However, few studies untangle these from platform-driven behavior researches.
  • Distortion of human behavior. Important information could be lost or obscured because of some design decisions. For example, Google stores and reports final searches submitted, after auto-completion is done. The text actually typed by the user may be a more accurate resource to analyze human behavior.
  • Nonhumans in large-scale studies. On all major online social platforms, there are large population of spammers and bots masquerading as normal humans. Besides, some social media accounts are created to strategically influence other users.
  • Your methods may be wrong. 
  • Proxy population mismatch. The quantitative relation between the proxy and original population studied is unknown, which may cause serious bias.
  • Incomparability of methods and data. Because of platforms’ sensitivity to user privacy and the competitive value of their data, it is difficult, if not impossible, for other researchers to evaluate results and compare to existing methods on the same data set.


Based on above analysis, Derek Ruths and Jurgen Pfeffer provide a checklist of approaches when using social media data.
  1. Quantifies platform-specific biases (platform design, user base, platform-specific behavior, platform storage policies)
  2. Quantifies biases of available data (access constraints, platform-side filtering)
  3. Quantifies proxy population biases/mismatches
  4. Applies filters/corrects for nonhuman accounts in data
  5. Accounts for platform and proxy population biases (Corrects for platform-specific and proxy population biases / Tests robustness of findings)
  6. Accounts for platform-specific algorithms(Shows results for more than one platform / Shows results for time-separated data sets from the same platform)
  7. For new methods: compares results to existing methods on the same data
  8. For new social phenomena or methods or classifiers: reports performance on two or more