Analyzing Tweets with NLP in Minutes with Spark, Optimus and Twint
Social media has been gold for studying the way people communicate and behave, in this article I’ll show you the easiest way of analyzing tweets without the Twitter API and scalable for Big Data.
Add sentiment directly to a Spark DataFrame
Transforming this code to Spark code it’s simple. This piece of code can help you transform other codes as well. So lets start importing the User Defined Function module from Spark:
from pyspark.sql.functions import udf
Then we will transform the code from above to a function:
def apply_blob(sentence): temp = TextBlob(sentence).sentiment[0] if temp == 0.0: return 0.0 # Neutral elif temp >= 0.0: return 1.0 # Positive else: return 2.0 # Negative
After that we will register the function as a Spark UDF:
sentiment = udf(apply_blob)
Then to apply the function to the whole dataframe we need to write:
clean_tweets.withColumn("sentiment", sentiment(clean_tweets['tweet'])).show()
And we will see:
Sentiment analysis, the good programmer way (Making the code modular)
This is not actually quality code. Let’s transform this into functions to use it over and over.
The first part is setting up everything:
%load_ext autoreload %autoreload 2 # Import twint import sys sys.path.append("twint/") # Set up TWINT config import twint c = twint.Config() # Other imports import seaborn as sns import os from optimus import Optimus op = Optimus() # Solve compatibility issues with notebooks and RunTime errors. import nest_asyncio nest_asyncio.apply() # Disable annoying printing class HiddenPrints: def __enter__(self): self._original_stdout = sys.stdout sys.stdout = open(os.devnull, 'w') def __exit__(self, exc_type, exc_val, exc_tb): sys.stdout.close() sys.stdout = self._original_stdout
The last part it’s a class that will remove the automatic printing of Twint so we just see the dataframe.
All of the things from above can be summarize in these functions:
from textblob import TextBlob from pyspark.sql.functions import udf from pyspark.sql.types import DoubleType # Function to get sentiment def apply_blob(sentence): temp = TextBlob(sentence).sentiment[0] if temp == 0.0: return 0.0 # Neutral elif temp >= 0.0: return 1.0 # Positive else: return 2.0 # Negative # UDF to write sentiment on DF sentiment = udf(apply_blob, DoubleType()) # Transform result to pandas def twint_to_pandas(columns): return twint.output.panda.Tweets_df[columns] def tweets_sentiment(search, limit=1): c.Search = search # Custom output format c.Format = "Username: {username} | Tweet: {tweet}" c.Limit = limit c.Pandas = True with HiddenPrints(): print(twint.run.Search(c)) # Transform tweets to pandas DF df_pd = twint_to_pandas(["date", "username", "tweet", "hashtags", "nlikes"]) # Transform Pandas DF to Optimus/Spark DF df = op.create.data_frame(pdf= df_pd) # Clean tweets clean_tweets = df.cols.remove_accents("tweet") \ .cols.remove_special_chars("tweet") # Add sentiment to final DF return clean_tweets.withColumn("sentiment", sentiment(clean_tweets['tweet']))
So to get the tweets and add sentiment we use:
df_result = tweets_sentiment("data science", limit=1)
df_result.show()
And that’s it :)
Lets see the distribution of the sentiments:
df_res_pandas = df_result.toPandas() sns.distplot(df_res_pandas['sentiment']) sns.set(rc={'figure.figsize':(11.7,8.27)})
Doing more with Twint
We can do more stuff, here I’ll show you how to create a simple function to get tweets, and also how to build a word cloud from them.
So to get the tweets from a simple search:
def get_tweets(search, limit=100): c = twint.Config() c.Search = search c.Limit = limit c.Pandas = True c.Pandas_clean = True with HiddenPrints(): print(twint.run.Search(c)) return twint.output.panda.Tweets_df[["username","tweet"]]
With this we can get thousands of tweets very easy:
tweets = get_tweets("data science", limit=10000) tweets.count() # 10003
To generate a word cloud this is all we need to do:
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator import matplotlib.pyplot as plt %matplotlib inline text = tweets.tweet.values # adding movie script specific stopwords stopwords = set(STOPWORDS) stopwords.add("https") stopwords.add("xa0") stopwords.add("xa0'") stopwords.add("bitly") stopwords.add("bit") stopwords.add("ly") stopwords.add("twitter") stopwords.add("pic") wordcloud = WordCloud( background_color = 'black', width = 1000, height = 500, stopwords = stopwords).generate(str(text))
I added some stopwords that are common in tweets that don’t matter to the analysis. To show it we use:
plt.imshow(wordcloud, interpolation=’bilinear’) plt.axis(“off”) plt.rcParams[‘figure.figsize’] = [10, 10]
And you’ll get:
Pretty but not that much. If we want good code we need modules so, let’s transform that into a function:
def generate_word_cloud(tweets): # Getting the text out of the tweets text = tweets.tweet.values # adding movie script specific stopwords stopwords = set(STOPWORDS) stopwords.add("https") stopwords.add("xa0") stopwords.add("xa0'") stopwords.add("bitly") stopwords.add("bit") stopwords.add("ly") stopwords.add("twitter") stopwords.add("pic") wordcloud = WordCloud( background_color = 'black', width = 1000, height = 500, stopwords = stopwords).generate(str(text)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.rcParams['figure.figsize'] = [10, 10]
And then we just run:
tweets = get_tweets("artificial intelligence", limit=1000) generate_word_cloud(tweets)
Try it yourself
There’s much more things that you can do with the library. Some other functions:
twint.run.Search()
- Fetch Tweets using the search filters (Normal);twint.run.Followers()
- Fetch a Twitter user's followers;twint.run.Following()
- Fetch who follows a Twitter user;twint.run.Favorites()
- Fetch Tweets a Twitter user has liked;twint.run.Profile()
- Fetch Tweets from a user's profile (Includes retweets);twint.run.Lookup()
- Fetch informations from a user's profile (bio, location, etc.).
Actually you can use it from the terminal. For that just run:
pip3 install --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint
Then just run go to the twint folder:
cd src/twint
And finally you can run for example:
twint -u TDataScience --since 2019-01-01 --o TDS.csv --csv
Here I’m getting all the tweets (845 so far) from the TDS Team of the year. Here is the CSV file if you want it:
Bonus (scaling the results)
Let’s get 10k tweets and get their sentiment, because why not. For that:
df_result = tweets_sentiment("data science", limit=100000) df_result.show()
This actually took almost 10 minutes so take your precautions. It may be faster to get the tweets from the CLI and then just applying the function. Let’s see how many tweets we have:
df_results.count()
And we have 10.031 tweets with sentiments! You can use them for training other models too.
Thanks for reading this, hopefully it can help you with your current job and understanding of data science. If you want to know more about me follow me on twitter:
Favio Vázquez (@FavioVaz) | Twitter
The latest Tweets from Favio Vázquez (@FavioVaz). Data Scientist. Physicist and computational engineer. I have a…twitter.comm
Bio: Favio Vazquez is a physicist and computer engineer working on Data Science and Computational Cosmology. He has a passion for science, philosophy, programming, and music. He is the creator of Ciencia y Datos, a Data Science publication in Spanish. He loves new challenges, working with a good team and having interesting problems to solve. He is part of Apache Spark collaboration, helping in MLlib, Core and the Documentation. He loves applying his knowledge and expertise in science, data analysis, visualization, and automatic learning to help the world become a better place.
Original. Reposted with permission.
Related:
- Data Science with Optimus Part 2: Setting your DataOps Environment
- Data Science with Optimus Part 1: Intro
- Optimus v2: Agile Data Science Workflows Made Easy