Bot or Not: an end-to-end data analysis in Python

Twitter bots are programs that compose and post tweets without human intervention, and they range widely in complexity. Here we are building a classifier with pandas, NLTK, and scikit-learn to identify Twitter bots.



These bots are weird

Fast-forward to clean, well-formatted data and it doesn’t take long to find fishiness. On average, bots follow 1400 people whereas humans follow 500. Bots are similarly strange in their distribution of followers. Humans have a fairly uniform distribution of followers. Some people are popular, some not so much, and many in between. Conversely, these bots are extremely unpopular with an average of a measly 28 followers.

Tweets into data

Sure, these bots look weird at the profile level, but lots of humans are unpopular and have the Twitter egg for a profile picture. How about what they’re saying? To incorporate the tweet data in the classifier, it needed to be summarized into one row per account. One such summary metric is lexical diversity, which is the ratio of unique tokens to total tokens in a document. Lexical diversity ranges from 0 to 1 where 0 indicates no words in a document, and 1 indicates that each word was used exactly once. You can think of it as a measure of lexical sophistication.

I used Pandas to quickly and elegantly apply summary functions like lexical diversity to the tweets. First I combined all the tweets per user into one document and tokenized it, so I was left with a list of words. Then I removed punctuation and stopwords with NLTK.

Pandas makes it super simple to apply custom functions over groups of data. With groupby I grouped the tweets by screen name and then applied the lexical diversity function to my groups of tweets. I love the simplicity and flexibility of this syntax, which makes it a breeze to group over any category and apply custom summary functions. For example, I could group by location or predicted gender, and compute the lexical diversity of all those slices just by modifying the grouping variable.

def lexical_diversity(text):
  if len(text) == 0:
    diversity = 0
  else: 
   diversity = float(len(set(text))) / len(text)
  return diversity

# Easily compute summaries for each user!
grouped = tweets.groupby('screen_name')
diversity = grouped.apply(lexical_diversity)

Again these bots look strange. Humans have a beautiful, almost textbook normal distribution of diversities centered at 0.70. Bots on the other hand have more mass at the extremes, especially towards one. A lexical diversity of one means that every word in the document is unique, implying that bots are either not tweeting much, or are tweeting random strings of text. diversity

Model Development

I used scikit-learn, the premier machine learning module in Python, for model development and validation. My analysis plan went something like this: since I’m primarily interested in predictive accuracy, why not just try a couple classification methods and see which one performs the best. One of the strengths of scikit-learn is a clean and consistent API for constructing models and pipelines that makes trying out a couple models simple.

# Naive Bayes
bayes = GaussianNB().fit(train[features], y)
bayes_predict = bayes.predict(test[features])

# Logistic regression
logistic = LogisticRegression().fit(train[features], y)
logistic_predict = logistic.predict(test[features])

# Random Forest
rf = RandomForestClassifier().fit(train[features], y)
rf_predict = rf.predict(test[features])

I fit three classifiers, a naive Bayes, logistic regression and random forest classifier. You can see that the syntax for each classification method is identical. In the first line I’m fitting the classifier, providing the features from the training set and the labels, y. Then it’s simple to generate predictions from the model fit by passing in the features from the test set and view accuracy measures from the classification report.

# Classification Metrics
print(metrics.classification_report(test.bot, bayes_predict))
print(metrics.classification_report(test.bot, logistic_predict))
print(metrics.classification_report(test.bot, rf_predict))

Not surprisingly random forest performed the best with an overall precision of 0.90 versus 0.84 for Naive Bayes and 0.87 for logistic regression. Amazingly with an out-of-the-box classifier, we are able to correctly identify bots 90% of the time, but can we do better? Yes, yes we can. It’s actually really easy to tune classifiers using GridSearchCV. GridSearchCV takes a classification method and a grid of parameter settings to explore. The “grid” is just a dictionary keyed off of the model’s configurable parameters. What’s rad about GridSearchCV is that you can treat it just like the classification methods we saw previously. That is, we can use .fit() and .predict().

# construct parameter grid
param_grid = {'max_depth': [1, 3, 6, 9, 12, 15, None],
              'max_features': [1, 3, 6, 9, 12],
              'min_samples_split': [1, 3, 6, 9, 12, 15],
              'min_samples_leaf': [1, 3, 6, 9, 12, 15],
              'bootstrap': [True, False],
              'criterion': ['gini', 'entropy']}

# fit best classifier
grid_search = GridSearchCV(RandomForestClassifier(), 
param_grid = param_grid).fit(train[features], y)

# assess predictive accuracy
predict = grid_search.predict(test[features])
print(metrics.classification_report(test.bot, predict))

Aha, better precision! The simple tuning step resulted in a configuration that yielded a 2% increase in precision. Inspecting the variable importance plot for the tuned random forest yields few surprises. The number of friends and followers are the most important variables for predicting bot-status.

variable_importance