Exploring Social Media Diversity with Natural Language Processing

This post uses natural language processing on Twitter data to determine the diversity of Twitter accounts the author is following. An innovative take on social media analytics.



Stage 2: Sorting the nouns into wheels

 
Once we have a collection of the top 50 most commonly occurring nouns (excluding the black-list), we want to sort them into their respective diversity wheel.

Now for each of these words we build up a synset. A Synset is a set of synonyms that share a common meaning. So for ‘technology’, the synset includes the nouns ‘technology’ and ‘engineering’.

I compiled a list of professions, sorted by similarity, so a oceanographer is next to a meteorologist, but far from mechanic. These professions then form a circle like this:

Professions circle

If I then match my social network into those professions and then weight them as a polar diagram I see what the distribution is over this chart.

Profession distribution chart

OK, so I follow a lot of CTOs, CEOs, engineers, developers, directors and clergy. Wait, what? I follow a lot of clergy?

For each profession, I listed other terms, for clergy we have clergy, vicar, rector and priest. When I debugged the algorithm it was finding a lot of uses of the word ‘father’ and a hypernym for father is of course priest. So my algorithm needs a little work.

Stage 3: tipping the scales

 
Now that I have the profession distribution, I build up other diversity wheels for age, gender. I considered sexual orientation but it was unreliable as not something people generally use to describe themselves in a profile.

Imagine each of these wheels in a series, they are interconnected, since a single person’s diversity reflects many aspects.

Colorful pinwheel

To create a diverse social network we want to rotate each of those wheels away from their current position. Think of the combination lock on your bike, you don’t just turn the numbers one-notch away from your key combination, you turn each one (hopefully) randomly to make it impossible for someone to guess the original combination.

Reconsider the profession wheel, I don’t want the algorithm to suggest only 1 profession based on the statistical point which is furthest from my median because that still wouldn’t be diverse. I want a collection of professions that are furthest away from my social network to expand. Those are (according to my charts) doctors, physicians, dentists, physiotherapists.

Gender distribution

According to my gender distribution, I need to follow a lot more women and those who identify as non-binary.

All of this was done in less than 200 lines of Python.

What does this mean?

 
I would love to see research like this picked up by the social media engines. Once I have the outputs from my algorithm, I’m still constrained by Twitter’s keyword search feature and the rate-limiting APIs.

Twitter has some characteristics that impact natural language processing. The most obvious is that the character limit forces non-natural language. People over-punctuate, use abbreviations and rarely use prepositions. It was a lot of trial and error to get NLTK to do what I wanted.

Looking a Twitter you can see how much of an issue this is, we should have a “who [not] to follow”, reversing the suggestion list and considerations around diversity.

Twitter

What next

 
I’m going to develop the algorithm to implicitly infer location, gender and age then build out the diversity wheels better for those aspects.

I’m going to measure the central point of the diversity wheels and their interconnections.

Offer a range of suggestions based on varying diversity, instead of simply assuming the opposite.

Bio: Anthony Shaw is Head of Innovation at Dimension Data. He leads research initiatives on DevOps, automation, cloud, machine learning and IoT as well as promoting innovation, free-thinking and global collaboration programmes. Anthony is involved as a member for both the Apache Software Foundation and the Python Software Foundation and makes 1000's of contributions yearly to support open-source through his work and research.

Original. Reposted with permission.

Related: