Using Machine Learning To Predict Gender
Here is an experiment from the CrowdFlower AI team, where they used user’s tweeter account link color, description, and a single random tweet with the word “and” or “the” in it and guessed who’s behind the curtain.
An added bonus of this section is that we learned our graphing program can handle emojis. We live in exciting times.
Like our male predictors, this one has a few surprises. For one, just look at how predictive that heart emoji is. It was the strongest predictor across all categories and it wasn’t even all that close. Not only that, a different heart emoji that was the fifth strongest predictor. That, of course, isn’t to say that every female account we saw had one of those, but rather that if a heart appeared in a tweet or profile, our model was very confident that account belonged to a woman. As we did with the gents, here are a few others worth comment:
- camgirl was the second leading predictor. Do not google that at work.
- psych is either the belated comeback of a really tubular 90’s slang word or there are a lot of women in our sample who are psych majors. We’re guessing the later. Reluctantly.
- f5abb5 is a hex color. And yes, before you click that, it’s pink.
A word about anti-predictors
Our model also looks at data that appears in the set but is actually unlikely to correlate to a certain account type. In other words: what phrases, colors, etc. don’t appear in men’s or women’s accounts. For men, you’ll of course a lot of data that appears as female predictors. But there were a few interesting additions:
“Feminist” for example, was the second most anti-predictive piece of data whereas it was fairly low on the list of female predictors. “Underground” is odd, but hey, let’s roll with it. And apparently, dudes dislike using smiley faces. Tres sad. :(
Now, a look at female anti-predictors:
The “@” symbol and “wrestling” are flipped here, but, if you’ll recall, were both male predictors. You’ll notice a few new hex codes up there, but yes, those are various shades of blue and black. Also: “pizza” is anti-predictive. Apparently dudes are super into pizza. We should all be super into pizza. Pizza is good and cool.
What about non-individuals?
Of course, Twitter isn’t peopled solely by, uh, people. There are also a whole host of brand accounts, media sources, bots, and so on. Ever conscious of blasting out too many graphs, we’ll skip the predictors, because they weren’t wholly exciting. We saw words like “official,” “reddit,” “worldwide,” “newspaper,” and “association.” Mostly, that’s expected. Those are words you’d expect to see in a brand account. But we did want to lift the hood on some of the anti-predictors:
Again, these are words that suggest an account belongs to a real human. It makes sense to see jobs in there like “strategist” and “writer.” “I” was another interesting finding that, when you stop to think about it for a quick second, makes a ton of sense. Non-individuals are also not “passionate,” nor do they much write about “vegetables.” This nice person (@passionatevegan) agrees with these findings.
In the end, the model is only about 60% confident it can look at an account, complete with link color, description, and a single random tweet with the word “and” or “the” in it and guess who’s behind the curtain. That makes sense. After all, we’re not all that much different. We use a lot of the same words. But in the end, we learned an important update on John Gray: Men are from wrestling, women are from heart emoji.
- A Neural Network in 11 lines of Python
- Tutorial: Building a Twitter Sentiment Analysis Proces
- Bot or Not: an end-to-end data analysis in Python