Using Machine Learning To Predict Gender
Here is an experiment from the CrowdFlower AI team, where they used user’s tweeter account link color, description, and a single random tweet with the word “and” or “the” in it and guessed who’s behind the curtain.
By Justin Tenuto, CrowdFlower.
This all started with a simple question: could we train an algorithm to determine if a Twitter account belonged to a man or a woman? With that in mind, we ran a simple data categorization job, fired up our brand new CrowdFlower AI feature, and tried to answer just that. What we found was, well, pretty damn interesting. But no spoilers. We’ll get to all that in a second. Let’s take a step back and start at the beginning.
Here’s how we did it
To run any CrowdFlower job, you of course need data, in this case tweets. The first challenge with a question like this is exactly what sort of tweets do you pull? To put it another way: if you fetch social data about, say, an especially seedy strip clubs, odds are, you’re going to get a few more male-authored tweets than female-authored ones. So we took our thumb off the scale and pulled 10,000 tweets with the word “the” in them and another 10,000 with the word “and” in them. But, importantly, we did a little something extra: in addition to a swath of random tweets, we also captured the user’s profile description (the “about me,” if you will), their profile image, and even the colors the accounts used for their links and sidebars.
With our data fetched, we ran a data categorization job where we asked our contributors to visit the profile pages of Twitter accounts and judge the gender of each. We had them bucket accounts into “male,” “female,” “brand or organization,” and gave them an option for “can’t tell” as well. Then, we ran the tweets through our AI feature.
And that’s where things got interesting.
We weren’t expecting the model to be super confident about its predictions–after all, each data row had just a single tweet, a profile, and some ancillary information to look at. But what we did manage to get were some major individual predictors (and anti-predictors) that strongly correlated to each account type. In other words: there are certain words, colors, and phrases that almost always mean an account belongs to a man or a woman. So how does CrowdFlower AI work? Here comes the science:
First, our machine learning feature looks at each data row (which in this case is a tweet, a profile, etc.) and the judgment our contributors made for each of those rows. Then, it looks for patterns. In accounts marked as male, what words come up most frequently? What come up least frequently? And since we pulled the colors these accounts used for their links and sidebars, the model was able to look at hex codes and figure out which colors were most often associated with men, women, or brands. The model then assigns a value to how predictive a certain piece of data is. In effect–or at least for our purposes here–that shakes out to a sort of top twenty-five words that predict an account is run by a man or woman.
And with that, the findings:
So what data is most predictive of a man’s Twitter account? As in, what word or phrase appears most often in men’s accounts and least often in other kinds? While a few of these we expected, we must say, number one was a bit of a surprise:
WRESTLING, BROTHER. Somewhere, the Macho Man is snapping into a Slim Jim and smiling approvingly.
A few words on some of the other predictors we found:
- thisiswhyweplay is an NBA hashtag. All hail Draymond Green.
- The @ symbol was a really interesting predictor. It suggests guys are more likely to talk to (or, in a lot of cases, talk at) another account than women or non-individual accounts are.
- 2fc2ef is a hex code, or, for those unfamiliar with that phrase, the way a computer describes color. And yes, 2fc2ef is blue.
We’ll get to the anti-predictors in a second, but since a fair share them predict an account belongs to a woman, let’s look at that data instead.