Analyzing Tweets with NLP in Minutes with Spark, Optimus and Twint

Social media has been gold for studying the way people communicate and behave, in this article I’ll show you the easiest way of analyzing tweets without the Twitter API and scalable for Big Data.

By Favio Vazquez, Founder at Ciencia y Datos on May 24, 2019 in Apache Spark, Big Data, Deep Learning, Machine Learning, NLP, Optimus, Python, Twint

comments

Introduction

If you are here it’s likely that you are interested in analyzing tweets (or something similar) and you have a lot of them, or can get them. One of the most annoying things for that is getting a Twitter application, get the authentication and all of that. And then if you are using Pandas, there’s no way to scale that.

So what about a system that doesn’t have to authenticate with the Twitter API, that can get an unlimited (well almost) amount of tweets and the power to analyze them, with NLP and more. Well you’re in for a treat because that’s exactly what I’m going to show you right now.

Getting the project and repo

https://matrixds.com/

You can follow everything I’m going to show you very easily. Just forklift this MatrixDS project:

MatrixDS | The Data Project Workbench
MatrixDS is a place to build, share and manage data projects at any scale.community.platform.matrixds.com

Also there’s a GitHub repo with everything:

FavioVazquez/twitter_optimus_twint
Analyzing tweets with Twint, Optimus and Apache Spark. - FavioVazquez/twitter_optimus_twintgithub.com

With MatrixDS you can actually run the notebooks, get the data and run the analysis for free, so if you want to learn more please do it.

Getting Twint and Optimus

Twint utilizes Twitter’s search operators to let you scrape Tweets from specific users, scrape Tweets relating to certain topics, hashtags & trends, or sort out sensitive information from Tweets like e-mail and phone numbers.

With Optimus, a library I co-created, you can clean your data, prepare it, analyze it, create profilers and plots, and perform machine learning and deep learning, all in a distributed fashion, because on the back-end we have Spark, TensorFlow, Sparkling Water and Keras.

So let’s first install everything you need, for that when you are in the Matrix project, go to the Analyze Tweets notebook and run (you can also do this from the JupyterLab terminal):

!pip install --user -r requirements.txt

After that, we need to install Twint, for that run:

!pip install --upgrade --user -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint

This will download a scr/ folder so we need to do some config:

!mv src/twint .
!rm -r src

Then to import Twint that we need to run:

%load_ext autoreload
%autoreload 2

import sys
sys.path.append("twint/")

and finally:

import twint

Optimus was installed in the first step, so let’s just start it (this will start a Spark cluster for you):

from optimus import Optimus
op = Optimus()

Setup Twint for scrapping tweets

https://www.theverge.com/2015/11/3/9661180/twitter-vine-favorite-fav-likes-hearts

# Set up TWINT config
c = twint.Config()

If you are running this on notebooks you’ll need also to run:

# Solve compatibility issues with notebooks and RunTime errors.
import nest_asyncio
nest_asyncio.apply()

Search for data science tweets

I’ll start our analysis scrapping tweets about data science, you can change this to whatever you want.

For doing that we just need to run this:

c.Search = "data science"
# Custom output format
c.Format = "Username: {username} |  Tweet: {tweet}"
c.Limit = 1
c.Pandas = True

twint.run.Search(c)

Let me explain this code to you. In the last section when we ran the code:

c = twint.Config()

we started a new Twint configuration. After that we need to pass different options we want to scrape tweets. Here’s the full list of configuring options:

Variable             Type       Description
--------------------------------------------
Username             (string) - Twitter user's username
User_id              (string) - Twitter user's user_id
Search               (string) - Search terms
Geo                  (string) - Geo coordinates (lat,lon,km/mi.)
Location             (bool)   - Set to True to attempt to grab a Twitter user's location (slow).
Near                 (string) - Near a certain City (Example: london)
Lang                 (string) - Compatible language codes: https://github.com/twintproject/twint/wiki/Langauge-codes
Output               (string) - Name of the output file.
Elasticsearch        (string) - Elasticsearch instance
Timedelta            (int)    - Time interval for every request (days)
Year                 (string) - Filter Tweets before the specified year.
Since                (string) - Filter Tweets sent since date (Example: 2017-12-27).
Until                (string) - Filter Tweets sent until date (Example: 2017-12-27).
Email                (bool)   - Set to True to show Tweets that _might_ contain emails.
Phone                (bool)   - Set to True to show Tweets that _might_ contain phone numbers.
Verified             (bool)   - Set to True to only show Tweets by _verified_ users
Store_csv            (bool)   - Set to True to write as a csv file.
Store_json           (bool)   - Set to True to write as a json file.
Custom               (dict)   - Custom csv/json formatting (see below).
Show_hashtags        (bool)   - Set to True to show hashtags in the terminal output.
Limit                (int)    - Number of Tweets to pull (Increments of 20).
Count                (bool)   - Count the total number of Tweets fetched.
Stats                (bool)   - Set to True to show Tweet stats in the terminal output.
Database             (string) - Store Tweets in a sqlite3 database. Set this to the DB. (Example: twitter.db)
To                   (string) - Display Tweets tweeted _to_ the specified user.
All                  (string) - Display all Tweets associated with the mentioned user.
Debug                (bool)   - Store information in debug logs.
Format               (string) - Custom terminal output formatting.
Essid                (string) - Elasticsearch session ID.
User_full            (bool)   - Set to True to display full user information. By default, only usernames are shown.
Profile_full         (bool)   - Set to True to use a slow, but effective method to enumerate a user's Timeline.
Store_object         (bool)   - Store tweets/user infos/usernames in JSON objects.
Store_pandas         (bool)   - Save Tweets in a DataFrame (Pandas) file.
Pandas_type          (string) - Specify HDF5 or Pickle (HDF5 as default).
Pandas               (bool)   - Enable Pandas integration.
Index_tweets         (string) - Custom Elasticsearch Index name for Tweets (default: twinttweets).
Index_follow         (string) - Custom Elasticsearch Index name for Follows (default: twintgraph).
Index_users          (string) - Custom Elasticsearch Index name for Users (default: twintuser).
Index_type           (string) - Custom Elasticsearch Document type (default: items).
Retries_count        (int)    - Number of retries of requests (default: 10).
Resume               (int)    - Resume from a specific tweet id (**currently broken, January 11, 2019**).
Images               (bool)   - Display only Tweets with images.
Videos               (bool)   - Display only Tweets with videos.
Media                (bool)   - Display Tweets with only images or videos.
Replies              (bool)   - Display replies to a subject.
Pandas_clean         (bool)   - Automatically clean Pandas dataframe at every scrape.
Lowercase            (bool)   - Automatically convert uppercases in lowercases.
Pandas_au            (bool)   - Automatically update the Pandas dataframe at every scrape.
Proxy_host           (string) - Proxy hostname or IP.
Proxy_port           (int)    - Proxy port.
Proxy_type           (string) - Proxy type.
Tor_control_port     (int) - Tor control port.
Tor_control_password (string) - Tor control password (not hashed).
Retweets             (bool)   - Display replies to a subject.
Hide_output          (bool)   - Hide output.
Get_replies          (bool)   - All replies to the tweet.

So in this code:

c.Search = "data science"
# Custom output format
c.Format = "Username: {username} |  Tweet: {tweet}"
c.Limit = 1
c.Pandas = True

We are setting the search term, them formatting the response (just to check), getting only 20 tweets with the Limit =1 (they are in increments of 20) and finally making the result compatible with Pandas.
Then when we run:

twint.run.Search(c)

We are launching the search. The result is:

Username: tmj_phl_pharm |  Tweet: If you're looking for work in Spring House, PA, check out this Biotech/Clinical/R&D/Science job via the link in our bio: KellyOCG Exclusive: Data Access Analyst in Spring House, PA- Direct Hire at Kelly Services #KellyJobs #KellyServices
Username: DataSci_Plow |  Tweet: Bring your Jupyter Notebook to life with interactive widgets  https://www.plow.io/post/bring-your-jupyter-notebook-to-life-with-interactive-widgets?utm_source=Twitter&utm_campaign=Data_science … +1 Hal2000Bot #data #science
Username: ottofwagner |  Tweet: Top 7 R Packages for Data Science and AI   https://noeliagorod.com/2019/03/07/top-7-r-packages-for-data-science-and-ai/ … #DataScience #rstats #MachineLearning
Username: semigoose1 |  Tweet: ëäSujy #crypto #bitcoin #java #competition #influencer #datascience #fintech #science #EU  https://vk.com/id15800296  https://semigreeth.wordpress.com/2019/05/03/easujy-crypto-bitcoin-java-competition-influencer-datascience-fintech-science-eu- https-vk-com-id15800296/ …
Username: Datascience__ |  Tweet: Introduction to Data Analytics for Business  http://zpy.io/c736cf9f  #datascience #ad
Username: Datascience__ |  Tweet: How Entrepreneurs in Emerging Markets can master the Blockchain Technology  http://zpy.io/f5fad501  #datascience #ad
Username: viktor_spas |  Tweet: [Перевод] Почему Data Science командам нужны универсалы, а не специалисты  https://habr.com/ru/post/450420/?utm_source=dlvr.it&utm_medium=twitter&utm_campaign=450420 … pic.twitter.com/i98frTwPSE
Username: gp_pulipaka |  Tweet: Orchestra is a #RPA for Orchestrating Project Teams. #BigData #Analytics #DataScience #AI #MachineLearning #Robotics #IoT #IIoT #PyTorch #Python #RStats #TensorFlow #JavaScript #ReactJS #GoLang #CloudComputing #Serverless #DataScientist #Linux @lruettimann  http://bit.ly/2Hn6qYd  pic.twitter.com/kXizChP59U
Username: amruthasuri |  Tweet: "Here's a typical example of a day in the life of a RagingFX trader. Yesterday I received these two signals at 10am EST. Here's what I did... My other activities have kept me so busy that ...  http://bit.ly/2Jm9WT1  #Learning #DataScience #bigdata #Fintech pic.twitter.com/Jbes6ro1lY
Username: PapersTrending |  Tweet: [1/10] Real numbers, data science and chaos: How to fit any dataset with a single parameter - 192 stars - pdf:  https://arxiv.org/pdf/1904.12320v1.pdf … - github: https://github.com/Ranlot/single-parameter-fit …
Username: webAnalyste |  Tweet: Building Data Science Capabilities Means Playing the Long Game  http://dlvr.it/R41k3t  pic.twitter.com/Et5CskR2h4
Username: DataSci_Plow |  Tweet: Building Data Science Capabilities Means Playing the Long Game  https://www.plow.io/post/building-data-science-capabilities-means-playing-the-long-game?utm_source=Twitter&utm_campaign=Data_science … +1 Hal2000Bot #data #science
Username: webAnalyste |  Tweet: Towards Well Being, with Data Science (part 2)  http://dlvr.it/R41k1K  pic.twitter.com/4VbljUcsLh
Username: DataSci_Plow |  Tweet: Understanding when Simple and Multiple Linear Regression give Different Results  https://www.plow.io/post/understanding-when-simple-and-multiple-linear-regression-give-different-results?utm_source=Twitter&utm_campaign=Data_science … +1 Hal2000Bot #data #science
Username: DataSci_Plow |  Tweet: Artificial Curiosity  https://www.plow.io/post/artificial-curiosity?utm_source=Twitter&utm_campaign=Data_science … +1 Hal2000Bot #data #science
Username: gp_pulipaka |  Tweet: Synchronizing the Digital #SCM using AI for Supply Chain Planning. #BigData #Analytics #DataScience #AI #RPA #MachineLearning #IoT #IIoT #Python #RStats #TensorFlow #JavaScript #ReactJS #GoLang #CloudComputing #Serverless #DataScientist #Linux @lruettimann  http://bit.ly/2KX8vrt  pic.twitter.com/tftxwilkQf
Username: DataSci_Plow |  Tweet: Extreme Rare Event Classification using Autoencoders in Keras  https://www.plow.io/post/extreme-rare-event-classification-using-autoencoders-in-keras?utm_source=Twitter&utm_campaign=Data_science … +1 Hal2000Bot #data #science
Username: DataSci_Plow |  Tweet: Five Methods to Debug your Neural Network  https://www.plow.io/post/five-methods-to-debug-your-neural-network?utm_source=Twitter&utm_campaign=Data_science … +1 Hal2000Bot #data #science
Username: iamjony94 |  Tweet: 26 Mobile and Desktop Tools for Marketers  http://bit.ly/2LkL3cN  #socialmedia #digitalmarketing #contentmarketing #growthhacking #startup #SEO #ecommerce #marketing #influencermarketing #blogging #infographic #deeplearning #ai #machinelearning #bigdata #datascience #fintech pic.twitter.com/mxHiY4eNXR
Username: TDWI |  Tweet: #ATL #DataPros: Our #analyst, @prussom is headed your way to speak @ the #FDSRoadTour on Wed, 5/8! Register to attend for free, learn about Modern #DataManagement in the Age of #Cloud & #DataScience: Trends, Challenges & Opportunities.  https://bit.ly/2WlYOJb  #Atlanta #freeevent

Doesn’t look that good but we got what we wanted. TWEETS!

Saving results into Pandas

Sadly there’s no direct connection between Twint and Spark, but we can do it with Pandas and then pass the result to Optimus.

I created to simple functions that you can see in the actual project that helps you with Pandas and the weird Twint API for this part. So when we run this:

available_columns()

You’ll see:

Index(['conversation_id', 'created_at', 'date', 'day', 'hashtags', 'hour','id', 'link', 'location', 'name', 'near', 'nlikes', 'nreplies','nretweets', 'place', 'profile_image_url', 'quote_url', 'retweet','search', 'timezone', 'tweet', 'user_id', 'user_id_str', 'username'],dtype='object')

These are the columns we have from the query we just did. There’s a lot of different things to do with this data, but for this article I’ll only use some of them. So to transform the result from Twint to Pandas we run:

df_pd = twint_to_pandas(["date", "username", "tweet", "hashtags", "nlikes"])

and you’ll see this Pandas DF:

Much better isn’t it?

Sentiment Analysis (the simple way)

We will run a sentiment analysis on some tweets, using Optimus and TextBlob a library for NLP. The first thing we need to do is clean this tweets, for that Optimus is the best choice.
For saving the data as an Optimus (Spark) DF we need to run:

df = op.create.data_frame(pdf= df_pd)

We’ll just remove accents and special characters with Optimus (for a real work scenario you need to do much more than this like removing links, images, and stopwords), for that:

clean_tweets = df.cols.remove_accents("tweet") \
                 .cols.remove_special_chars("tweet")

Then we need to collect this tweets from Spark to get them in a Python list, for that:

tweets = clean_tweets.select("tweet").rdd.flatMap(lambda x: x).collect()

Then to analyze the sentiment of these tweets we will use TextBlob sentiment function:

from textblob import TextBlob
from IPython.display import Markdown, display

# Pretty printing the result
def printmd(string, color=None):
    colorstr = "<span style='color:{}'>{}</span>".format(color, string)
    display(Markdown(colorstr))

for tweet in tweets:
    print(tweet)
    analysis = TextBlob(tweet)
    print(analysis.sentiment)
    if analysis.sentiment[0]>0:
        printmd('Positive', color="green")
    elif analysis.sentiment[0]<0:
        printmd('Negative', color="red")
    else:
        printmd("No result", color="grey")
        print("")

That will give us:

IAM Platform Curated Retweet  Via  httpstwittercomarmaninspace  ArtificialIntelligence AI What About The User Experience  httpswwwforbescomsitestomtaulli20190427artificialintelligenceaiwhatabouttheuserexperience  AI DataScience MachineLearning BigData DeepLearning Robots IoT ML DL IAMPlatform TopInfluence ArtificialIntelligence
Sentiment(polarity=0.0, subjectivity=0.0)

Neutral

Seattle Data Science Career Advice Landing a Job in The Emerald City Tips from Metis Seattle Career Advisor Marybeth Redmond –  httpsbitly2IYjzaj  pictwittercom98hMYZVxsu
Sentiment(polarity=0.0, subjectivity=0.0)

Neutral

This webinarworkshop is designed for business leaders data science managers and decision makers who want to build effective AI and data science capabilities for their organization Register here  httpsbitly2GDQeQT  pictwittercomxENQ0Dtv1X
Sentiment(polarity=0.6, subjectivity=0.8)

Positive

Contoh yang menarik dari sport science kali ini dari sisi statistik dan pemetaan lapangan Dengan makin gencarnya scientific method masuk di sport maka pengolahan data seperti ini akan semakin menjadi hal biasa  httpslnkdinfQHqgjh
Sentiment(polarity=0.0, subjectivity=0.0)

Neutral

Complete handson machine learning tutorial with data science Tensorflow artificial intelligence and neural networks  Machine Learning Data Science and Deep Learning with Python   httpsmedia4yousocialcareerdevelopmenthtmlmachinelearning  python machine learning online data science udemy elearning pictwittercomqgGVzRUFAM
Sentiment(polarity=-0.16666666666666666, subjectivity=0.6)

Negative

We share criminal data bases have science and medical collaoarations Freedom of movement means we can live and work in EU countries with no hassle at all much easier if youre from a poorer background We have Erasmus loads more good things
Sentiment(polarity=0.18939393939393936, subjectivity=0.39166666666666666)

Positive

Value of Manufacturers Shipments for Durable Goods BigData DataScience housing rstats ggplot pictwittercomXy0UIQtNHy
Sentiment(polarity=0.0, subjectivity=0.0)

Neutral

Top DataScience and MachineLearning Methods Used in 2018 2019 AI MoRebaie TMounaged AINow6 JulezNorton  httpswwwkdnuggetscom201904topdatasciencemachinelearningmethods20182019html
Sentiment(polarity=0.5, subjectivity=0.5)

Positive

Come check out the Santa Monica Data Science  Artificial Intelligence meetup to learn about In PersonComplete Handson Machine Learning Tutorial with Data Science  httpbitly2IRh0GU
Sentiment(polarity=-0.6, subjectivity=1.0)

Negative

Great talks about the future of multimodality clinical translation and data science Very inspiring 1stPETMRIsymposium unitue PETMRI molecularimaging AI pictwittercomO542P9PKXF
Sentiment(polarity=0.4833333333333334, subjectivity=0.625)

Positive

Did engineering now into data science last 5 years and doing MSC in data science this year
Sentiment(polarity=0.0, subjectivity=0.06666666666666667)

Neutral

Program Officer – Data Science  httpbitly2PV3ROF
Sentiment(polarity=0.0, subjectivity=0.0)

Neutral.

And so on.

Well that was extremely easy, but it won’t scale, because in the end we are collecting the data from Spark so the driver’s RAM is the limit. Let’s do it a little better.

Analyzing Tweets with NLP in Minutes with Spark, Optimus and Twint

Introduction

Getting the project and repo

Getting Twint and Optimus

Setup Twint for scrapping tweets

Search for data science tweets

Saving results into Pandas

Sentiment Analysis (the simple way)

More On This Topic

Latest Posts

Top Posts