Bot or Not: an end-to-end data analysis in Python
Twitter bots are programs that compose and post tweets without human intervention, and they range widely in complexity. Here we are building a classifier with pandas, NLTK, and scikit-learn to identify Twitter bots.
In this post I want to discuss an Internets phenomena knows as bots, specifically Twitter bots. I’m focusing on Twitter bots primarily because they’re fun and funny, but also because Twitter happens to provide a rich and comprehensive API that allows users to access information about the platform and how it’s used. In short, it makes for a compelling demonstration of Python’s prowess for data analysis work, and also areas of relative weakness.
For those unfamiliar with Twitter (who are you monsters?), it’s a social media platform on which users can post 140 character fart jokes called “tweets” (that joke bombed at PyData btw, but I can’t let it go). Twitter is distinct from other social media in that by default tweets are public and there’s no expectation that followers actually know one another. You can think of Twitter less as a stream of personal news, and more as a marketplace of ideas where the currency is favs and retweets.
Another distinguishing feature of Twitter is the “embed-ability” of its content (hilarious example above). It’s commonplace nowadays to see tweets included as part of news media. This is due in no small part to the openness of their APIs, which allow developers to programmatically tweet and view timelines. But the same openness that makes twitter pervasive across the internet, also opens the door for unwelcome users, like bots.
Twitter bots are programs that compose and post tweets without human intervention, and they range widely in complexity. Some are relatively inert, living mostly to follow people and fav things, while others use sophisticated algorithms to create, at times, very convincing speech. All bots can be a nuisance because their participation in the Twittersphere undermines the credibility of Twitter’s analytics and marketing attribution, and ultimately their bottom line.
So what can Twitter do about them? Well, the first step is to identify them. Here’s how I did it.
The objective is to build a classifier to identify accounts likely belonging to bots, and I took a supervised learning approach. “Supervised” means we need labeled data, i.e. we need to know at the outset which accounts belong to bots and which belong to humans. In past work this thankless task had been accomplished through the use (and abuse) of grad students. For example, Jajodia et al manually inspected accounts and applied a Twitter version of the Turing test–if it looks like a bot, and tweets like a bot, then it’s a bot. The trouble is, I’m not a grad student anymore and my time has value (that joke killed). I solved this problem thanks to a hot tip from friend and co-worker Jim Vallandingham, who introduced me to fiverr, a website offering dubious services for $5.
Five dollars and 24 hours later, I had 5,500 new followers. Since I knew who followed me prior to the bot swarm, I could positively identify them as humans and all my overnight followers as bots.
Due to the richness of the Twitter REST API, creating the feature set required significantly less terms-of-service-violating behavior. I used the python-twitter module to query two endpoints: GET users/lookup and GET statuses/user_timeline. The users/lookup endpoint returns a JSON blob containing information you could expect to find on a user’s profile page, e.g. indicators of whether they’re using default profile settings, follower/following counts, and tweet count. From GET/user_time I grabbed the last 200 tweets of everyone in my dataset.
The trouble is, Twitter isn’t going to let you just roll in and request all the data you want. They enforce rate limits on the API, which means you’re going to have to take a little cat nap in between requests. I accomplished this in part with the charming method, blow_chunks:
# don’t exceed API limits def blow_chunks(self, data, max_chunk_size): for i in range(0, len(data), max_chunk_size): yield data[i:i + max_chunk_size]
blow_chunks takes as input a list of your queries, for example user ids, and breaks it into chunks of a maximum size. But it doesn’t return those chunks, it returns a generator, which can be used thusly:
chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size) while True: try: current_chunk = chunks.next() for user in current_chunk: try: user_data = self.api.GetUser(user_id = str(user)) results.append(user_data.AsDict()) except: print "got a twitter error! D:" pass print "nap time. ZzZzZzzzzz..." time.sleep(60 * 16) continue except StopIteration: break
If the query size is bigger than the maximum allowed, then break it into chunks. Call the .next() method of generators to grab the first chunk and send that request to the API. Then grab a beer because, there’s 16 minutes until the next request is sent. When there aren’t anymore chunks left, the generator will throw a StopIteration and break out of the loop.