Analyzing Tweets with NLP in Minutes with Spark, Optimus and Twint
Social media has been gold for studying the way people communicate and behave, in this article I’ll show you the easiest way of analyzing tweets without the Twitter API and scalable for Big Data.
Introduction
If you are here it’s likely that you are interested in analyzing tweets (or something similar) and you have a lot of them, or can get them. One of the most annoying things for that is getting a Twitter application, get the authentication and all of that. And then if you are using Pandas, there’s no way to scale that.
So what about a system that doesn’t have to authenticate with the Twitter API, that can get an unlimited (well almost) amount of tweets and the power to analyze them, with NLP and more. Well you’re in for a treat because that’s exactly what I’m going to show you right now.
Getting the project and repo
You can follow everything I’m going to show you very easily. Just forklift this MatrixDS project:
Also there’s a GitHub repo with everything:
With MatrixDS you can actually run the notebooks, get the data and run the analysis for free, so if you want to learn more please do it.
Getting Twint and Optimus
Twint utilizes Twitter’s search operators to let you scrape Tweets from specific users, scrape Tweets relating to certain topics, hashtags & trends, or sort out sensitive information from Tweets like e-mail and phone numbers.
With Optimus, a library I co-created, you can clean your data, prepare it, analyze it, create profilers and plots, and perform machine learning and deep learning, all in a distributed fashion, because on the back-end we have Spark, TensorFlow, Sparkling Water and Keras.
So let’s first install everything you need, for that when you are in the Matrix project, go to the Analyze Tweets notebook and run (you can also do this from the JupyterLab terminal):
!pip install --user -r requirements.txt
After that, we need to install Twint, for that run:
!pip install --upgrade --user -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint
This will download a scr/ folder so we need to do some config:
!mv src/twint . !rm -r src
Then to import Twint that we need to run:
%load_ext autoreload %autoreload 2
import sys sys.path.append("twint/")
and finally:
import twint
Optimus was installed in the first step, so let’s just start it (this will start a Spark cluster for you):
from optimus import Optimus op = Optimus()
Setup Twint for scrapping tweets
# Set up TWINT config c = twint.Config()
If you are running this on notebooks you’ll need also to run:
# Solve compatibility issues with notebooks and RunTime errors. import nest_asyncio nest_asyncio.apply()
Search for data science tweets
I’ll start our analysis scrapping tweets about data science, you can change this to whatever you want.
For doing that we just need to run this:
c.Search = "data science" # Custom output format c.Format = "Username: {username} | Tweet: {tweet}" c.Limit = 1 c.Pandas = True
twint.run.Search(c)
Let me explain this code to you. In the last section when we ran the code:
c = twint.Config()
we started a new Twint configuration. After that we need to pass different options we want to scrape tweets. Here’s the full list of configuring options:
Variable Type Description -------------------------------------------- Username (string) - Twitter user's username User_id (string) - Twitter user's user_id Search (string) - Search terms Geo (string) - Geo coordinates (lat,lon,km/mi.) Location (bool) - Set to True to attempt to grab a Twitter user's location (slow). Near (string) - Near a certain City (Example: london) Lang (string) - Compatible language codes: https://github.com/twintproject/twint/wiki/Langauge-codes Output (string) - Name of the output file. Elasticsearch (string) - Elasticsearch instance Timedelta (int) - Time interval for every request (days) Year (string) - Filter Tweets before the specified year. Since (string) - Filter Tweets sent since date (Example: 2017-12-27). Until (string) - Filter Tweets sent until date (Example: 2017-12-27). Email (bool) - Set to True to show Tweets that _might_ contain emails. Phone (bool) - Set to True to show Tweets that _might_ contain phone numbers. Verified (bool) - Set to True to only show Tweets by _verified_ users Store_csv (bool) - Set to True to write as a csv file. Store_json (bool) - Set to True to write as a json file. Custom (dict) - Custom csv/json formatting (see below). Show_hashtags (bool) - Set to True to show hashtags in the terminal output. Limit (int) - Number of Tweets to pull (Increments of 20). Count (bool) - Count the total number of Tweets fetched. Stats (bool) - Set to True to show Tweet stats in the terminal output. Database (string) - Store Tweets in a sqlite3 database. Set this to the DB. (Example: twitter.db) To (string) - Display Tweets tweeted _to_ the specified user. All (string) - Display all Tweets associated with the mentioned user. Debug (bool) - Store information in debug logs. Format (string) - Custom terminal output formatting. Essid (string) - Elasticsearch session ID. User_full (bool) - Set to True to display full user information. By default, only usernames are shown. Profile_full (bool) - Set to True to use a slow, but effective method to enumerate a user's Timeline. Store_object (bool) - Store tweets/user infos/usernames in JSON objects. Store_pandas (bool) - Save Tweets in a DataFrame (Pandas) file. Pandas_type (string) - Specify HDF5 or Pickle (HDF5 as default). Pandas (bool) - Enable Pandas integration. Index_tweets (string) - Custom Elasticsearch Index name for Tweets (default: twinttweets). Index_follow (string) - Custom Elasticsearch Index name for Follows (default: twintgraph). Index_users (string) - Custom Elasticsearch Index name for Users (default: twintuser). Index_type (string) - Custom Elasticsearch Document type (default: items). Retries_count (int) - Number of retries of requests (default: 10). Resume (int) - Resume from a specific tweet id (**currently broken, January 11, 2019**). Images (bool) - Display only Tweets with images. Videos (bool) - Display only Tweets with videos. Media (bool) - Display Tweets with only images or videos. Replies (bool) - Display replies to a subject. Pandas_clean (bool) - Automatically clean Pandas dataframe at every scrape. Lowercase (bool) - Automatically convert uppercases in lowercases. Pandas_au (bool) - Automatically update the Pandas dataframe at every scrape. Proxy_host (string) - Proxy hostname or IP. Proxy_port (int) - Proxy port. Proxy_type (string) - Proxy type. Tor_control_port (int) - Tor control port. Tor_control_password (string) - Tor control password (not hashed). Retweets (bool) - Display replies to a subject. Hide_output (bool) - Hide output. Get_replies (bool) - All replies to the tweet.
So in this code:
c.Search = "data science" # Custom output format c.Format = "Username: {username} | Tweet: {tweet}" c.Limit = 1 c.Pandas = True
We are setting the search term, them formatting the response (just to check), getting only 20 tweets with the Limit =1 (they are in increments of 20) and finally making the result compatible with Pandas.
Then when we run:
twint.run.Search(c)
We are launching the search. The result is:
Username: tmj_phl_pharm | Tweet: If you're looking for work in Spring House, PA, check out this Biotech/Clinical/R&D/Science job via the link in our bio: KellyOCG Exclusive: Data Access Analyst in Spring House, PA- Direct Hire at Kelly Services #KellyJobs #KellyServices Username: DataSci_Plow | Tweet: Bring your Jupyter Notebook to life with interactive widgets https://www.plow.io/post/bring-your-jupyter-notebook-to-life-with-interactive-widgets?utm_source=Twitter&utm_campaign=Data_science … +1 Hal2000Bot #data #science Username: ottofwagner | Tweet: Top 7 R Packages for Data Science and AI https://noeliagorod.com/2019/03/07/top-7-r-packages-for-data-science-and-ai/ … #DataScience #rstats #MachineLearning Username: semigoose1 | Tweet: ëäSujy #crypto #bitcoin #java #competition #influencer #datascience #fintech #science #EU https://vk.com/id15800296 https://semigreeth.wordpress.com/2019/05/03/easujy-crypto-bitcoin-java-competition-influencer-datascience-fintech-science-eu- https-vk-com-id15800296/ … Username: Datascience__ | Tweet: Introduction to Data Analytics for Business http://zpy.io/c736cf9f #datascience #ad Username: Datascience__ | Tweet: How Entrepreneurs in Emerging Markets can master the Blockchain Technology http://zpy.io/f5fad501 #datascience #ad Username: viktor_spas | Tweet: [Перевод] Почему Data Science командам нужны универсалы, а не специалисты https://habr.com/ru/post/450420/?utm_source=dlvr.it&utm_medium=twitter&utm_campaign=450420 … pic.twitter.com/i98frTwPSE Username: gp_pulipaka | Tweet: Orchestra is a #RPA for Orchestrating Project Teams. #BigData #Analytics #DataScience #AI #MachineLearning #Robotics #IoT #IIoT #PyTorch #Python #RStats #TensorFlow #JavaScript #ReactJS #GoLang #CloudComputing #Serverless #DataScientist #Linux @lruettimann http://bit.ly/2Hn6qYd pic.twitter.com/kXizChP59U Username: amruthasuri | Tweet: "Here's a typical example of a day in the life of a RagingFX trader. Yesterday I received these two signals at 10am EST. Here's what I did... My other activities have kept me so busy that ... http://bit.ly/2Jm9WT1 #Learning #DataScience #bigdata #Fintech pic.twitter.com/Jbes6ro1lY Username: PapersTrending | Tweet: [1/10] Real numbers, data science and chaos: How to fit any dataset with a single parameter - 192 stars - pdf: https://arxiv.org/pdf/1904.12320v1.pdf … - github: https://github.com/Ranlot/single-parameter-fit … Username: webAnalyste | Tweet: Building Data Science Capabilities Means Playing the Long Game http://dlvr.it/R41k3t pic.twitter.com/Et5CskR2h4 Username: DataSci_Plow | Tweet: Building Data Science Capabilities Means Playing the Long Game https://www.plow.io/post/building-data-science-capabilities-means-playing-the-long-game?utm_source=Twitter&utm_campaign=Data_science … +1 Hal2000Bot #data #science Username: webAnalyste | Tweet: Towards Well Being, with Data Science (part 2) http://dlvr.it/R41k1K pic.twitter.com/4VbljUcsLh Username: DataSci_Plow | Tweet: Understanding when Simple and Multiple Linear Regression give Different Results https://www.plow.io/post/understanding-when-simple-and-multiple-linear-regression-give-different-results?utm_source=Twitter&utm_campaign=Data_science … +1 Hal2000Bot #data #science Username: DataSci_Plow | Tweet: Artificial Curiosity https://www.plow.io/post/artificial-curiosity?utm_source=Twitter&utm_campaign=Data_science … +1 Hal2000Bot #data #science Username: gp_pulipaka | Tweet: Synchronizing the Digital #SCM using AI for Supply Chain Planning. #BigData #Analytics #DataScience #AI #RPA #MachineLearning #IoT #IIoT #Python #RStats #TensorFlow #JavaScript #ReactJS #GoLang #CloudComputing #Serverless #DataScientist #Linux @lruettimann http://bit.ly/2KX8vrt pic.twitter.com/tftxwilkQf Username: DataSci_Plow | Tweet: Extreme Rare Event Classification using Autoencoders in Keras https://www.plow.io/post/extreme-rare-event-classification-using-autoencoders-in-keras?utm_source=Twitter&utm_campaign=Data_science … +1 Hal2000Bot #data #science Username: DataSci_Plow | Tweet: Five Methods to Debug your Neural Network https://www.plow.io/post/five-methods-to-debug-your-neural-network?utm_source=Twitter&utm_campaign=Data_science … +1 Hal2000Bot #data #science Username: iamjony94 | Tweet: 26 Mobile and Desktop Tools for Marketers http://bit.ly/2LkL3cN #socialmedia #digitalmarketing #contentmarketing #growthhacking #startup #SEO #ecommerce #marketing #influencermarketing #blogging #infographic #deeplearning #ai #machinelearning #bigdata #datascience #fintech pic.twitter.com/mxHiY4eNXR Username: TDWI | Tweet: #ATL #DataPros: Our #analyst, @prussom is headed your way to speak @ the #FDSRoadTour on Wed, 5/8! Register to attend for free, learn about Modern #DataManagement in the Age of #Cloud & #DataScience: Trends, Challenges & Opportunities. https://bit.ly/2WlYOJb #Atlanta #freeevent
Doesn’t look that good but we got what we wanted. TWEETS!
Saving results into Pandas
Sadly there’s no direct connection between Twint and Spark, but we can do it with Pandas and then pass the result to Optimus.
I created to simple functions that you can see in the actual project that helps you with Pandas and the weird Twint API for this part. So when we run this:
available_columns()
You’ll see:
Index(['conversation_id', 'created_at', 'date', 'day', 'hashtags', 'hour','id', 'link', 'location', 'name', 'near', 'nlikes', 'nreplies','nretweets', 'place', 'profile_image_url', 'quote_url', 'retweet','search', 'timezone', 'tweet', 'user_id', 'user_id_str', 'username'],dtype='object')
These are the columns we have from the query we just did. There’s a lot of different things to do with this data, but for this article I’ll only use some of them. So to transform the result from Twint to Pandas we run:
df_pd = twint_to_pandas(["date", "username", "tweet", "hashtags", "nlikes"])
and you’ll see this Pandas DF:
Much better isn’t it?
Sentiment Analysis (the simple way)
We will run a sentiment analysis on some tweets, using Optimus and TextBlob a library for NLP. The first thing we need to do is clean this tweets, for that Optimus is the best choice.
For saving the data as an Optimus (Spark) DF we need to run:
df = op.create.data_frame(pdf= df_pd)
We’ll just remove accents and special characters with Optimus (for a real work scenario you need to do much more than this like removing links, images, and stopwords), for that:
clean_tweets = df.cols.remove_accents("tweet") \ .cols.remove_special_chars("tweet")
Then we need to collect this tweets from Spark to get them in a Python list, for that:
tweets = clean_tweets.select("tweet").rdd.flatMap(lambda x: x).collect()
Then to analyze the sentiment of these tweets we will use TextBlob sentiment function:
from textblob import TextBlob from IPython.display import Markdown, display # Pretty printing the result def printmd(string, color=None): colorstr = "<span style='color:{}'>{}</span>".format(color, string) display(Markdown(colorstr))
for tweet in tweets: print(tweet) analysis = TextBlob(tweet) print(analysis.sentiment) if analysis.sentiment[0]>0: printmd('Positive', color="green") elif analysis.sentiment[0]<0: printmd('Negative', color="red") else: printmd("No result", color="grey") print("")
That will give us:
IAM Platform Curated Retweet Via httpstwittercomarmaninspace ArtificialIntelligence AI What About The User Experience httpswwwforbescomsitestomtaulli20190427artificialintelligenceaiwhatabouttheuserexperience AI DataScience MachineLearning BigData DeepLearning Robots IoT ML DL IAMPlatform TopInfluence ArtificialIntelligence Sentiment(polarity=0.0, subjectivity=0.0)
Neutral
Seattle Data Science Career Advice Landing a Job in The Emerald City Tips from Metis Seattle Career Advisor Marybeth Redmond – httpsbitly2IYjzaj pictwittercom98hMYZVxsu Sentiment(polarity=0.0, subjectivity=0.0)
Neutral
This webinarworkshop is designed for business leaders data science managers and decision makers who want to build effective AI and data science capabilities for their organization Register here httpsbitly2GDQeQT pictwittercomxENQ0Dtv1X Sentiment(polarity=0.6, subjectivity=0.8)
Positive
Contoh yang menarik dari sport science kali ini dari sisi statistik dan pemetaan lapangan Dengan makin gencarnya scientific method masuk di sport maka pengolahan data seperti ini akan semakin menjadi hal biasa httpslnkdinfQHqgjh Sentiment(polarity=0.0, subjectivity=0.0)
Neutral
Complete handson machine learning tutorial with data science Tensorflow artificial intelligence and neural networks Machine Learning Data Science and Deep Learning with Python httpsmedia4yousocialcareerdevelopmenthtmlmachinelearning python machine learning online data science udemy elearning pictwittercomqgGVzRUFAM Sentiment(polarity=-0.16666666666666666, subjectivity=0.6)
Negative
We share criminal data bases have science and medical collaoarations Freedom of movement means we can live and work in EU countries with no hassle at all much easier if youre from a poorer background We have Erasmus loads more good things Sentiment(polarity=0.18939393939393936, subjectivity=0.39166666666666666)
Positive
Value of Manufacturers Shipments for Durable Goods BigData DataScience housing rstats ggplot pictwittercomXy0UIQtNHy Sentiment(polarity=0.0, subjectivity=0.0)
Neutral
Top DataScience and MachineLearning Methods Used in 2018 2019 AI MoRebaie TMounaged AINow6 JulezNorton httpswwwkdnuggetscom201904topdatasciencemachinelearningmethods20182019html Sentiment(polarity=0.5, subjectivity=0.5)
Positive
Come check out the Santa Monica Data Science Artificial Intelligence meetup to learn about In PersonComplete Handson Machine Learning Tutorial with Data Science httpbitly2IRh0GU Sentiment(polarity=-0.6, subjectivity=1.0)
Negative
Great talks about the future of multimodality clinical translation and data science Very inspiring 1stPETMRIsymposium unitue PETMRI molecularimaging AI pictwittercomO542P9PKXF Sentiment(polarity=0.4833333333333334, subjectivity=0.625)
Positive
Did engineering now into data science last 5 years and doing MSC in data science this year Sentiment(polarity=0.0, subjectivity=0.06666666666666667)
Neutral
Program Officer – Data Science httpbitly2PV3ROF Sentiment(polarity=0.0, subjectivity=0.0)
Neutral.
And so on.
Well that was extremely easy, but it won’t scale, because in the end we are collecting the data from Spark so the driver’s RAM is the limit. Let’s do it a little better.