Exploring Social Media Diversity with Natural Language Processing

This post uses natural language processing on Twitter data to determine the diversity of Twitter accounts the author is following. An innovative take on social media analytics.

By Anthony Shaw, Dimension Data.

It’s early afternoon, I have 4 hours until my flight from SFO to LAX, then I have 2 hours layover until my 15 hour flight from LAX to Sydney. I have time to burn.

I decided to set myself a little challenge, can I develop an algorithm that finds people on social media to follow that are not like me? That was last week, I’m still going now that I’m back in Sydney and it’s still keeping me up at night. Turns out this is quite a complex problem to solve.

I was inspired by 2 episodes of one of my favorite podcasts- “Talk Python to Me”, an episode on data science and language processing and an episode on diversity.

Twitter’s recommendation algorithm

Twitter has a recommendation algorithm. It looks at who you follow, who they follow and any common keywords to suggest other people with similar friends, interests etc. This approach is used by all the social media tools.

Twitter graph

This is really helpful, allows you to build up a lot of followers (or followees) in your world, but there is one fatal flaw in the algorithm.

It assumes you want to follow people who are all more or less the same. I realized that the 1,000 people I am following on twitter are guys, in their 20’s and 30’s, with kids, interested in cloud, technology, programming and Python. What am I really going to learn from them? Sure I’ll polish my technical skills and find out about the latest cool new utility and project going around. But it won’t expand my world view on anything and I’ll become a more narrow minded individual.

I initially thought I could just search for the opposite (antonym) of my profile and search for that. It’s not that simple! What is the opposite of ‘enthusiast’ and does that even make sense? What is the opposite of ‘developer’, well the opposite would be subjective.

The diversity wheel

Diversity wheel

Diversity is complex, this is an illustration of a wheel that shows some vectors you can consider when assessing diversity.

I’m going to pick on the ones that I think people are likely to share on social media.

People don’t have profiles that say “I’m a right-wing, 32 year old straight woman with 2 kids, earning $35,000/yr as a tax accountant

Profiles can be cryptic or irrelevant. Robert Downey Jr’s is simply “You know who I am.

Stage 1: The noun cloud

The first phase of the algorithm is to collect all the profiles of the people I follow, then match them to certain “diversity wheels”, the first wheel I experimented with was Profession, since this was the most likely to be shared in a profile.

I’m using the Python library NLTK (Natural Language Toolkit) for this analysis the code is open https://github.com/tonybaloney/wntf

I thought I would try and characterize the words that I care about (nouns) and group them to establish patterns in your social circles.

The first thing to look at is the nouns in the followers description and the most common nouns. For me, this is:

{‘NN’: [(‘https’, 80),(‘cloud’, 56), (‘@’, 39), (‘technology’, 36), (‘http’, 31), (‘software’, 30), (‘developer’, 28), (‘business’, 28), (‘world’, 26), (‘father’, 24), (‘news’, 23), (‘fan’, 21), (‘account’, 20), (‘source’, 20), (‘husband’, 20), (‘enthusiast’, 18), (‘team’, 18), (‘web’, 18), (‘geek’, 17), (‘code’, 17)], }

So what can we tell about those nouns?

We then filter out certain nouns that commonly occur, such as ‘tweet’, ‘views’, ‘opinions’, since a lot of people have a statement about their views not representing their employer etc. etc.

Once you filter that list I can see that my followers’ characteristics in a few traits:

  • Their industry ‘business’, ‘technology’
  • Their role ‘developer’
  • Their gender ‘husband’, ‘father’
  • Their interests ‘web’, ‘code’, ‘software’
  • The way they describe themselves ‘geek’, ‘enthusiast’

Looking at the Proper nouns (NNP) I can also get some other interesting information:

‘NNP’: [(‘@’, 313),(‘Cloud’, 92), (‘|’, 74), (‘Data’, 63), (‘IT’, 44), (‘Dimension’, 39), (‘Software’, 36), (‘Microsoft’, 35), (‘Director’, 35), (‘Python’, 32), (‘Manager’, 31), (‘Husband’, 26), (‘Developer’, 25), (‘CTO’, 25), (‘Architect’, 25), (‘CEO’, 24), (‘Engineer’, 24), (‘/’, 24), (‘Technology’, 23), (‘Dad’, 23)],

Again filtering out some of the fluff, like @ and /

  • Company data Microsoft, Dimension (Data)
  • Role ‘CTO’, ‘CEO’, Engineer, Architect

Noun wordcloud

My noun cloud