Bot or Not: an end-to-end data analysis in Python

Twitter bots are programs that compose and post tweets without human intervention, and they range widely in complexity. Here we are building a classifier with pandas, NLTK, and scikit-learn to identify Twitter bots.

Still, we need better tools for iterative model development

There’s still a lot of room for growth in scikit-learn, particularly in functions for generating model diagnostics and utilities for model comparison. As an illustrative example of what I mean, I want to take you away to another world where the language isn’t Python, it’s R. And there’s no scikit-learn, there’s only caret. Let me show you some of the strengths of caret that could be replicated in scikit-learn.

Below is the output from the confusionMatrix function, the conceptual equivalent of scikit-learn‘s classification_report. What you’ll notice about the output of confusionMatrix is the depth of accuracy reporting. There’s the confusion matrix and lots of accuracy measures that use the confusion matrix as input. Most of the time you’ll probably only use one or two of the measures, but it’s nice to have them all available so that you can use what works best in your situation without having to write extra code.

> confusionMatrix(logistic_predictions, test$bot)
Confusion Matrix and Statistics

Prediction   0   1
         0 394  22
         1 144  70

               Accuracy : 0.7365          
                 95% CI : (0.7003, 0.7705)
    No Information Rate : 0.854           
    P-Value [Acc > NIR] : 1               

                  Kappa : 0.3183          
 Mcnemars Test P-Value :

One of the biggest strengths of caret is the ability to extract inferential model diagnostics, something that’s virtually impossible to do with scikit-learn. When fitting a regression method for example, you’ll naturally want to view coefficients, test statistics, p-values and goodness-of-fit metrics. Even if you’re only interested in predictive accuracy, there’s value to understanding what the model is actually saying and knowing whether the assumptions of the method are met. To replicate this type of output in Python would require refitting the model in something like statsmodels, which makes the model development process wasteful and tedious.



Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.2620  -0.6323  -0.4834  -0.0610   6.0228  

                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -5.7136     0.7293  -7.835 4.71e-15 ***
statuses_count   -2.4120     0.5026  -4.799 1.59e-06 ***
friends_count    30.8238     3.2536   9.474  < 2e-16 ***
followers_count -69.4496    10.7190  -6.479 9.22e-11 ***

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2172.3  on 2521  degrees of freedom
Residual deviance: 1858.3  on 2518  degrees of freedom
AIC: 1866.3

Number of Fisher Scoring iterations: 13

But I think the best feature of R’s caret package is the ease with which you can compare models. Using the resamples function I can quickly generate visualizations to compare model performance on metrics of my choosing. These type of utility functions are super useful during model development, but also in communication of early results where you don’t want to spend a ton of time making finalized figures.

# compare models
results = resamples(list(tree_model = tree_model, 
                         bagged_model = bagged_model,
                         boost_model = boost_model))
# plot results

For me, these features make all the difference and are a huge part of why R is still my preferred language for model development.


If you learned anything from this read, I hope it’s that Python is an extremely powerful tool for data tasks. We were able to retrieve data through an API, clean and process the data, develop, and test a classifier all with Python. We’ve also seen that there’s room for improvement. Utilities for fast, iterative model development are rich in R’s caret package, and caret serves as a great model for future development in scikit-learn.

Bio: Erin Shellman is a statistician + programmer working as a research scientist at Amazon Web Services – S3. Before joining AWS, she was a Data Scientist in the Nordstrom Data Lab where I worked in the area of personalization, building product recommendations for