Tips for Getting Started with Text Mining in R and Python
This article opens up the world of text mining in a simple and intuitive way and provides great tips to get started with text mining.
By Chaitanya Sagar, Perceptive Analytics.
It All Starts With The Text
There is so much of information lying in the text posts made by you and me and all others about all the trending topics today. Being in our respective firms, big or small, each of us collect some data related to our respective businesses and store it to analyze for various projects. At the same time, we all need this ‘unstructured data’ to know and understand more about our clients, customers and the state of our company in the world today. However, working with this data is not easy. The data is not structured, every piece does not have all the information and each part is unique. This is how textual data is. It needs to be processed first and converted in a form that is suitable for analysis. This is very similar to our own databases which we create except that they cannot be used directly and the amount of data is very large. This article opens up the world of text mining in a simple and intuitive way and provides great tips to get started with text mining.
Tip #1: Think First
The mammoth of text mining can become a simple task if you work on it with a plan in mind. Think what you need to do with text before going all out on it. What is your objective behind text mining? What sources of data do you want to use? How much data do you need for it to be sufficient? How do you plan to present your results from the data? It is all about getting curious about your problem and break it into small fragments. Thinking through the problem also opens up your mind towards the various situations you may encounter and ways to tackle those situations. You can then chart out a workflow and start pursuing the task.
Tip #2: R or Python.. Or Something Else?
There is no gold standard procedure for text mining. You have to choose the method which is most convenient for text mining. This is where factors such as efficiency,effectiveness type of problem and other factors come into play and helps you decide the best candidate for your problem. After having decided your chosen path, you need to build your knowledge and skills in developing skills in that language. I find the text mining techniques more intuitive in Python than in R but R has some handy functions to do tasks such as word counting and is richer in terms of packages available for text mining.
Tip #3: Start Early and Collect Your Data
- The usual process of text mining involves the following steps:
- Collect data; either from social media such as twitter or other websites. Write your code that can adjust to the specific type of text you collect and store it
- Convert your data into readable text
- Remove special characters from the text (such as hashtags). You can add a hashtag count feature if that is required
- Removing numbers from the text data (unless the problem requires numbers)
- Deciding whether to keep all the data or remove some of it such as all non-English text
- Converting all the text to uppercase or lowercase only to ease analysis
- Removing stop words.. Words that have no use in your analysis. This includes articles, conjunctions, etc.
- Using word stemming and grouping similar words such as ‘keep’ and ‘keeping’ are same words used in different tense form.
- Final analysis of the processed stemmed words and visualize results
The steps are short and simple but they all depend on the first step executed well. You need to collect your data so that text mining can be performed on it. There are many ways to collect data. One of the most popular sources to collect data from is Twitter. Twitter has exposed some APIs so that tweets can be mined using both R and Python. Besides twitter, one can capture data from any website today including e-commerce websites, movie websites, song websites, etc. Some websites also contain preformatted repositories of text data such as project gutenberg, corpora, etc. Google trends and yahoo also offer some analysis online.
Tip #4: Find and Use The Best Way to Convert Text to Data
Based on the tools and your project objective, you may use a different approach to convert your collected text to data. If you are using R, packages such as twitteR, tm and stringr are what you may be using for most of the preprocessing. The nltk library and Tweepy package are the equivalent packages in Python. Whichever language and package you use, make sure that you have enough resources and memory to handle the data. Text mining can be cumbersome just because of the irrelavant text lying around in your data even after removing stop words. Using a good method to prepare data will give you a lot of useful information when you apply modelling techniques on the data.
Tip #5: Explore and Play Around
You need to know your data before preprocessing it. Without the knowledge of how your data looks like, you might carelessly remove text which might have been useful in your analysis. There are many standard methods and dictionaries of removing stopwords and assigning importance to words but they may or may not apply to your data. For example, data about the government may include a lot of words such as ‘rule’, ‘govern’ and ‘politics’ which you may deem unnecessary and want to remove. Reviews may include lots of ‘hi’ in the beginning but may not be useful for a review dataset. It is always a good step to look at the source of data and go through some of the text to know how the process you defined for analysis is working to transform it correctly into useful information. Other ways specific to exploring text data is by creating a document term matrix. A document term matrix is a m*n matrix where the number of columns denote the total number of unique words in the entire dataset and the number of rows denote the total data points. Each cell thus represents the count of the particular word in that datapoint. This is a very large matrix and is later collapsed into term-frequency. From this document term matrix, one can count the total occurrences of each word in the dataset and that is exactly what term-frequency matrix stores. Other uses of document term matrix include knowing correlation between words, drawing a word cloud using term-frequency or predicting patterns using modelling techniques. This exploration will further give you confidence on the best way to move forward with textual data analysis.
Tip #6: Dive Deep and Get Your Hands Dirty
The primary objective or every machine learning and data science project is to find patterns in the data that are otherwise hard to find. You need to look for those interesting patterns and are not a true data scientist if you’re scared of this step. It can be as simple as fitting a simple classifier to classify data points and see its performance. This will set a benchmark while giving you an idea of the predicting ability of the data. At times, the data may be biased or have a poor predictive power and data quality checks can help define this. For example, If I am collecting twitter data on the basis of hashtags, I can divide my collected data into train and test datasets keeping the hashtags as the dependent feature. If my prediction performance is not up to the mark, I need to go back a few steps and find out the cause of this low performance and then check how I am collecting data or how I am cleaning my data as the case may be. Other ways of getting patterns involve associations. For example, some data points may be related to each other while others may have a similar or opposite pattern. If tweets are being used for text mining, there can be duplicate tweets because of retweeting or debates going for or against a remark. Working with data also exposes problems such as dealing with sarcasm or comments that convey mixed expressions. Without brushing through the data, it will be difficult to know how much of your data is affected by these problems and whether you should drop such data or use some technique to handle the situation.
Tip #7: Rework and Repeat
The problem you are trying to solve may or may not be the first text mining problem in your company but it is certainly not the first text mining problem in the world today. There are several data scientists out there who have worked on either the same or similar problem as the one you are working on and knowing what methods they followed and what they did differently will help you take your problem solving to the next level. Though not as frequent as other domains, there are several analysis and projects being done on text mining which include finding the trending topics, sentiment analysis on the trending topics, identifying remarks about your firm or product, identifying grievances and appreciations and the like. With the same data, there can be more than one problem that can be solved. Complex problems which can be explored also include NLP and topic modelling. I read about a fairly recent project in which some students predicted the next topic which a group of people will discuss based on the current conversation. There can be many such new projects which can be thought of and pursued in the area of text mining but since it is a new and hot field to work on, always refer other similar data and resources to further compliment your analysis and come up with strong insights.
Tip #8: Presenting Text Visually
As mentioned earlier, there can be a lot of problems which can be pursued using text mining and more than one problem can be solved from the same data. With so much to present, it is a good practice to come up with ways to present the results in a way that would seem attractive to people. This is why most of the text mining results are already visualized in the form of word clouds, sentiment studies and figures. There are a packages and libraries for each such task which include wordcloud, ggplot2, igraph, text2vec, networkD3 and plotly in R and Networkx, matplotlib, plotly in Python. You can also use other sophisticated tools just for visualization such as Tableau or Power BI which can help visualize your data in many more ways.
Conclusion: A Roadmap
Visualizing results is not the end step in text mining projects. Since text is captured from online sources, it is constantly changing and so is the data that is captured. With the changing data comes changing insights and hence, when the project is completed and accepted, it should be continuously updated with new data and new insights. These insights can be further enriched with the rate of change. With time, the change can also be captured and used as a metric of progression. This becomes another longitudinal problem to be solved. Apart from the problems which can be pursued with text data, text mining is no easy feat. When you create a roadmap of collecting, cleaning and analyzing data, there may be several obstacles that will come your way. They can be situations when you have to decide whether to work with a single word frequency in document term matrix or use groups of words (known as n-grams) or building your own visualization method to present your results or memory management. At the same time, new projects are coming up in the area of text mining. The best way to learn is to face the problem hands on and learn from the experience of working on the problem. Hope this article provides motivation to head to the world of text and start mining insightful nuggets of information.
Bio: Chaitanya Sagar is the Founder and CEO of Perceptive Analytics. Perceptive Analytics has been chosen as one of the top 10 analytics companies to watch out for by Analytics India Magazine. It works on Marketing Analytics for e-commerce, Retail and Pharma companies.