Bot or Not: an end-to-end data analysis in Python
Twitter bots are programs that compose and post tweets without human intervention, and they range widely in complexity. Here we are building a classifier with pandas, NLTK, and scikit-learn to identify Twitter bots.
Still, we need better tools for iterative model development
There’s still a lot of room for growth in scikit-learn, particularly in functions for generating model diagnostics and utilities for model comparison. As an illustrative example of what I mean, I want to take you away to another world where the language isn’t Python, it’s R. And there’s no scikit-learn, there’s only caret. Let me show you some of the strengths of caret that could be replicated in scikit-learn.
Below is the output from the confusionMatrix function, the conceptual equivalent of scikit-learn‘s classification_report. What you’ll notice about the output of confusionMatrix is the depth of accuracy reporting. There’s the confusion matrix and lots of accuracy measures that use the confusion matrix as input. Most of the time you’ll probably only use one or two of the measures, but it’s nice to have them all available so that you can use what works best in your situation without having to write extra code.
> confusionMatrix(logistic_predictions, test$bot) Confusion Matrix and Statistics Reference Prediction 0 1 0 394 22 1 144 70 Accuracy : 0.7365 95% CI : (0.7003, 0.7705) No Information Rate : 0.854 P-Value [Acc > NIR] : 1 Kappa : 0.3183 Mcnemars Test P-Value :
One of the biggest strengths of caret is the ability to extract inferential model diagnostics, something that’s virtually impossible to do with scikit-learn. When fitting a regression method for example, you’ll naturally want to view coefficients, test statistics, p-values and goodness-of-fit metrics. Even if you’re only interested in predictive accuracy, there’s value to understanding what the model is actually saying and knowing whether the assumptions of the method are met. To replicate this type of output in Python would require refitting the model in something like statsmodels, which makes the model development process wasteful and tedious.
summary(logistic_model) Call: NULL Deviance Residuals: Min 1Q Median 3Q Max -1.2620 -0.6323 -0.4834 -0.0610 6.0228 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.7136 0.7293 -7.835 4.71e-15 *** statuses_count -2.4120 0.5026 -4.799 1.59e-06 *** friends_count 30.8238 3.2536 9.474 < 2e-16 *** followers_count -69.4496 10.7190 -6.479 9.22e-11 *** --- (Dispersion parameter for binomial family taken to be 1) Null deviance: 2172.3 on 2521 degrees of freedom Residual deviance: 1858.3 on 2518 degrees of freedom AIC: 1866.3 Number of Fisher Scoring iterations: 13
But I think the best feature of R’s caret package is the ease with which you can compare models. Using the resamples function I can quickly generate visualizations to compare model performance on metrics of my choosing. These type of utility functions are super useful during model development, but also in communication of early results where you don’t want to spend a ton of time making finalized figures.
# compare models results = resamples(list(tree_model = tree_model, bagged_model = bagged_model, boost_model = boost_model)) # plot results dotplot(results)
For me, these features make all the difference and are a huge part of why R is still my preferred language for model development.
If you learned anything from this read, I hope it’s that Python is an extremely powerful tool for data tasks. We were able to retrieve data through an API, clean and process the data, develop, and test a classifier all with Python. We’ve also seen that there’s room for improvement. Utilities for fast, iterative model development are rich in R’s caret package, and caret serves as a great model for future development in scikit-learn.
Bio: Erin Shellman is a statistician + programmer working as a research scientist at Amazon Web Services – S3. Before joining AWS, she was a Data Scientist in the Nordstrom Data Lab where I worked in the area of personalization, building product recommendations for Nordstrom.com.
- Comics Recommendations: “Tinder for Comics” built with Tapastic and PredictionIO
- Tinderbox: Automating Romance with Tinder and Eigenfaces
- Top stories for Oct 19-25: Ebola Data Science Lessons; DM Radio, Oct 30 on Predictive Tools with KDnuggets, Predixion