Analyzing Tweets with NLP in Minutes with Spark, Optimus and Twint

Social media has been gold for studying the way people communicate and behave, in this article I’ll show you the easiest way of analyzing tweets without the Twitter API and scalable for Big Data.



Add sentiment directly to a Spark DataFrame

 

figure-name

Transforming this code to Spark code it’s simple. This piece of code can help you transform other codes as well. So lets start importing the User Defined Function module from Spark:

from pyspark.sql.functions import udf


Then we will transform the code from above to a function:

def apply_blob(sentence):
    temp = TextBlob(sentence).sentiment[0]
    if temp == 0.0:
        return 0.0 # Neutral
    elif temp >= 0.0:
        return 1.0 # Positive
    else:
        return 2.0 # Negative


After that we will register the function as a Spark UDF:

sentiment = udf(apply_blob)


Then to apply the function to the whole dataframe we need to write:

clean_tweets.withColumn("sentiment", sentiment(clean_tweets['tweet'])).show()


And we will see:

figure-name

 

Sentiment analysis, the good programmer way (Making the code modular)

 

figure-name

This is not actually quality code. Let’s transform this into functions to use it over and over.

The first part is setting up everything:

%load_ext autoreload
%autoreload 2

# Import twint
import sys
sys.path.append("twint/")

# Set up TWINT config
import twint
c = twint.Config()

# Other imports
import seaborn as sns
import os
from optimus import Optimus
op = Optimus()

# Solve compatibility issues with notebooks and RunTime errors.
import nest_asyncio
nest_asyncio.apply()

# Disable annoying printing

class HiddenPrints:
    def __enter__(self):
        self._original_stdout = sys.stdout
        sys.stdout = open(os.devnull, 'w')

    def __exit__(self, exc_type, exc_val, exc_tb):
        sys.stdout.close()
        sys.stdout = self._original_stdout


The last part it’s a class that will remove the automatic printing of Twint so we just see the dataframe.

All of the things from above can be summarize in these functions:

from textblob import TextBlob
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

# Function to get sentiment
def apply_blob(sentence):
    temp = TextBlob(sentence).sentiment[0]
    if temp == 0.0:
        return 0.0 # Neutral
    elif temp >= 0.0:
        return 1.0 # Positive
    else:
        return 2.0 # Negative

# UDF to write sentiment on DF
sentiment = udf(apply_blob, DoubleType())

# Transform result to pandas
def twint_to_pandas(columns):
    return twint.output.panda.Tweets_df[columns]

def tweets_sentiment(search, limit=1):
    c.Search = search
    # Custom output format
    c.Format = "Username: {username} |  Tweet: {tweet}"
    c.Limit = limit
    c.Pandas = True
    with HiddenPrints():
        print(twint.run.Search(c))

    # Transform tweets to pandas DF
    df_pd = twint_to_pandas(["date", "username", "tweet", "hashtags", "nlikes"])

    # Transform Pandas DF to Optimus/Spark DF
    df = op.create.data_frame(pdf= df_pd)

    # Clean tweets
    clean_tweets = df.cols.remove_accents("tweet") \
                 .cols.remove_special_chars("tweet")

    # Add sentiment to final DF
    return clean_tweets.withColumn("sentiment",    sentiment(clean_tweets['tweet']))


So to get the tweets and add sentiment we use:

df_result = tweets_sentiment("data science", limit=1)


df_result.show()


figure-name

And that’s it :)

Lets see the distribution of the sentiments:

df_res_pandas = df_result.toPandas()
sns.distplot(df_res_pandas['sentiment'])
sns.set(rc={'figure.figsize':(11.7,8.27)})


figure-name

 

Doing more with Twint

 

We can do more stuff, here I’ll show you how to create a simple function to get tweets, and also how to build a word cloud from them.

So to get the tweets from a simple search:

def get_tweets(search, limit=100):
    c = twint.Config()
    c.Search = search
    c.Limit = limit
    c.Pandas = True
    c.Pandas_clean = True

with HiddenPrints():
        print(twint.run.Search(c))
    return twint.output.panda.Tweets_df[["username","tweet"]]


With this we can get thousands of tweets very easy:

tweets = get_tweets("data science", limit=10000)

tweets.count() # 10003


To generate a word cloud this is all we need to do:

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
%matplotlib inline

text = tweets.tweet.values

# adding movie script specific stopwords
stopwords = set(STOPWORDS)
stopwords.add("https")
stopwords.add("xa0")
stopwords.add("xa0'")
stopwords.add("bitly")
stopwords.add("bit")
stopwords.add("ly")
stopwords.add("twitter")
stopwords.add("pic")

wordcloud = WordCloud(
    background_color = 'black',
    width = 1000,
    height = 500,
    stopwords = stopwords).generate(str(text))


I added some stopwords that are common in tweets that don’t matter to the analysis. To show it we use:

plt.imshow(wordcloud, interpolation=’bilinear’)
plt.axis(“off”)
plt.rcParams[‘figure.figsize’] = [10, 10]


And you’ll get:

figure-name

Pretty but not that much. If we want good code we need modules so, let’s transform that into a function:

def generate_word_cloud(tweets):

    # Getting the text out of the tweets
    text = tweets.tweet.values

    # adding movie script specific stopwords
    stopwords = set(STOPWORDS)
    stopwords.add("https")
    stopwords.add("xa0")
    stopwords.add("xa0'")
    stopwords.add("bitly")
    stopwords.add("bit")
    stopwords.add("ly")
    stopwords.add("twitter")
    stopwords.add("pic")

wordcloud = WordCloud(
        background_color = 'black',
        width = 1000,
        height = 500,
        stopwords = stopwords).generate(str(text))

    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.rcParams['figure.figsize'] = [10, 10]


And then we just run:

tweets = get_tweets("artificial intelligence", limit=1000)
generate_word_cloud(tweets)


figure-name

 

Try it yourself

 

figure-name

There’s much more things that you can do with the library. Some other functions:

  • twint.run.Search() - Fetch Tweets using the search filters (Normal);
  • twint.run.Followers() - Fetch a Twitter user's followers;
  • twint.run.Following() - Fetch who follows a Twitter user;
  • twint.run.Favorites() - Fetch Tweets a Twitter user has liked;
  • twint.run.Profile() - Fetch Tweets from a user's profile (Includes retweets);
  • twint.run.Lookup() - Fetch informations from a user's profile (bio, location, etc.).

Actually you can use it from the terminal. For that just run:

pip3 install --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint


Then just run go to the twint folder:

cd src/twint


And finally you can run for example:

twint -u TDataScience --since 2019-01-01 --o TDS.csv --csv


Here I’m getting all the tweets (845 so far) from the TDS Team of the year. Here is the CSV file if you want it:

FavioVazquez/twitter_optimus_twint
Analyzing tweets with Twint, Optimus and Apache Spark. - FavioVazquez/twitter_optimus_twintgithub.com

 

Bonus (scaling the results)

 

figure-name

Let’s get 10k tweets and get their sentiment, because why not. For that:

df_result = tweets_sentiment("data science", limit=100000)

df_result.show()


This actually took almost 10 minutes so take your precautions. It may be faster to get the tweets from the CLI and then just applying the function. Let’s see how many tweets we have:

df_results.count()


And we have 10.031 tweets with sentiments! You can use them for training other models too.

Thanks for reading this, hopefully it can help you with your current job and understanding of data science. If you want to know more about me follow me on twitter:

Favio Vázquez (@FavioVaz) | Twitter
The latest Tweets from Favio Vázquez (@FavioVaz). Data Scientist. Physicist and computational engineer. I have a…twitter.com
m

 
Bio: Favio Vazquez is a physicist and computer engineer working on Data Science and Computational Cosmology. He has a passion for science, philosophy, programming, and music. He is the creator of Ciencia y Datos, a Data Science publication in Spanish. He loves new challenges, working with a good team and having interesting problems to solve. He is part of Apache Spark collaboration, helping in MLlib, Core and the Documentation. He loves applying his knowledge and expertise in science, data analysis, visualization, and automatic learning to help the world become a better place.

Original. Reposted with permission.

Related: