Analyzing Tweets with NLP in Minutes with Spark, Optimus and Twint
Social media has been gold for studying the way people communicate and behave, in this article I’ll show you the easiest way of analyzing tweets without the Twitter API and scalable for Big Data.
Add sentiment directly to a Spark DataFrame
Transforming this code to Spark code it’s simple. This piece of code can help you transform other codes as well. So lets start importing the User Defined Function module from Spark:
Then we will transform the code from above to a function:
After that we will register the function as a Spark UDF:
Then to apply the function to the whole dataframe we need to write:
And we will see:
Sentiment analysis, the good programmer way (Making the code modular)
This is not actually quality code. Let’s transform this into functions to use it over and over.
The first part is setting up everything:
The last part it’s a class that will remove the automatic printing of Twint so we just see the dataframe.
All of the things from above can be summarize in these functions:
So to get the tweets and add sentiment we use:
And that’s it :)
Lets see the distribution of the sentiments:
Doing more with Twint
We can do more stuff, here I’ll show you how to create a simple function to get tweets, and also how to build a word cloud from them.
So to get the tweets from a simple search:
With this we can get thousands of tweets very easy:
To generate a word cloud this is all we need to do:
I added some stopwords that are common in tweets that don’t matter to the analysis. To show it we use:
And you’ll get:
Pretty but not that much. If we want good code we need modules so, let’s transform that into a function:
And then we just run:
Try it yourself
There’s much more things that you can do with the library. Some other functions:
twint.run.Search()- Fetch Tweets using the search filters (Normal);
twint.run.Followers()- Fetch a Twitter user's followers;
twint.run.Following()- Fetch who follows a Twitter user;
twint.run.Favorites()- Fetch Tweets a Twitter user has liked;
twint.run.Profile()- Fetch Tweets from a user's profile (Includes retweets);
twint.run.Lookup()- Fetch informations from a user's profile (bio, location, etc.).
Actually you can use it from the terminal. For that just run:
Then just run go to the twint folder:
And finally you can run for example:
Here I’m getting all the tweets (845 so far) from the TDS Team of the year. Here is the CSV file if you want it:
Bonus (scaling the results)
Let’s get 10k tweets and get their sentiment, because why not. For that:
This actually took almost 10 minutes so take your precautions. It may be faster to get the tweets from the CLI and then just applying the function. Let’s see how many tweets we have:
And we have 10.031 tweets with sentiments! You can use them for training other models too.
Thanks for reading this, hopefully it can help you with your current job and understanding of data science. If you want to know more about me follow me on twitter:
Bio: Favio Vazquez is a physicist and computer engineer working on Data Science and Computational Cosmology. He has a passion for science, philosophy, programming, and music. He is the creator of Ciencia y Datos, a Data Science publication in Spanish. He loves new challenges, working with a good team and having interesting problems to solve. He is part of Apache Spark collaboration, helping in MLlib, Core and the Documentation. He loves applying his knowledge and expertise in science, data analysis, visualization, and automatic learning to help the world become a better place.
Original. Reposted with permission.
- Data Science with Optimus Part 2: Setting your DataOps Environment
- Data Science with Optimus Part 1: Intro
- Optimus v2: Agile Data Science Workflows Made Easy