An NLP Approach to Analyzing Twitter, Trump, and Profanity

Who swears more? Do Twitter users who mention Donald Trump swear more than those who mention Hillary Clinton? Let’s find out by taking a natural language processing approach (or, NLP for short) to analyzing tweets.



Step Two: Collecting Data

 
It’s time to call our script with our query ‘Donald Trump OR Trump’ which will grab tweets with the terms ‘Donald Trump’ or ‘Trump,’ and will then write a file to your data file called ‘Donald-Trump-OR-Trump.csv’.

python twitter_pull_data.py 'Donald Trump OR Trump'


Try running the script again, but this time passing in ‘Hillary Clinton OR Hillary’ as the query.

With both CSV files in our data folder, we can now create a script called profanity_analysis.py

Step Three: Data Preprocessing

 
In this next script, we’ll first clean up our dirty data, get rid of emoticons, hashtags, RT’s, etc. Then, we’ll explore the English stop words and profanity algorithms.

That’s it for cleaning up our tweets!

Step Four: Checking Tweets for Profanity

 
Now, we’ll check out the Profanity Detection algorithm and discover the swear words in our tweets. This algorithm is based on around 340 words from noswearing.com, which does a basic string match to catch swear words. Check out the Profanity algorithm page to learn more about the details of the algorithm, and how you can customize your word list by adding your own offensive words since fun, new offensive colloquialisms are constantly being added to the English language everyday. Don’t believe us? Just check out Urban Dictionary for some new favorites that have popped up.

The profanity function is fairly straightforward:

You’re simply passing in the list of words that have been cleaned of English stop words. We’ve joined them into a single corpus since we’re interested in the total profanity of all the tweets from our data, rather than the profanity of each tweet. Our function profanity() prints out both the result of the algorithm along with the total swear words. At the time of this writing there were 30 swear words for the query ‘Donald Trump OR Trump’ and ‘Hillary Clinton OR Clinton’ returns 8 swear words. 

When we pulled our Twitter data, we also grabbed the user_id and the count of retweets. This is useful because you might want to gauge the popularity of a tweet by doing some light analysis in order to find the probability of whether or not a tweet is likely to be more or less popular given the amount of profanity used.

If you want to see this code in its entirety check out our Sample-Apps on GitHub!

Next Steps

 
Be sure to check out our other NLP algorithms such as social sentiment analysis or LDA (tags). Microservices like AnalyzeTweets combine the previously mentioned algorithms with one that retrieves tweets. This algorithm returns the negative and positive sentiment of each tweet along with the negative and positive LDA of each tweet. There is no shortage of combinations you can create to do either quick exploratory analysis, or add algorithms such as profanity or Nudity Detection to your app to make sure your content is family friendly.

Enjoy exploring the platform and as always if you have any questions feel free to reach out!

Bio: Stephanie Kim is a Developer Evangelist at Algorithmia.

Original. Reposted with permission.

Related: