Text Mining with R: The Free eBook
This freely-available book will show you how to perform text analytics in R, using packages from the tidyverse.
I readily admit that I'm biased toward Python. This isn't intentional — such is the case with many biases — but coming from a computer science background and having been programming since a very young age, I have naturally tended towards general purpose programming languages (Java, C, C++, Python, etc.). This is the major reason that Python books and resources are at the forefront of my radar, recommendations, and reviews.
Obviously, however, not all data scientists are in this same position, given that there are innumerable paths to data science. Given that, and since R is powerful and popular programming language for a large swath of data scientists, today let's take a look at a book which uses R as a tool to implement solutions to data science problems.
R is designed specifically for statistical computing, in juxtaposition to general purpose languages, the trade-off being that the relative lack of generality means better optimization for specialized scenarios. R's optimization for statistical computing is a big reason why it enjoys such high levels of adoption in data science and analytics.
Text analytics — like all applications and sub-genres of natural language processing — is continually reaching increasing heights of importance for data science, data scientists, and a variety of industries. As R (and its opinionated collection of packages designed for data science, the tidyverse) is an established environment for statistical computing utilized by data scientists, fully capable of performing text analytics, today we will look at Text Mining for R: A Tidy Approach.
Written by Julia Silge and David Robinson, this book endeavors to cover the following major topics, taken from the outline in the book's preface:
- We start by introducing the tidy text format, and some of the ways dplyr, tidyr, and tidytext allow informative analyses of this structure.
- Text won’t be tidy at all stages of an analysis, and it is important to be able to convert back and forth between tidy and non-tidy formats.
- We conclude with several case studies that bring together multiple tidy text mining approaches we’ve learned.
For a more fleshed out list of topics treated within, the book's table of contents are as follows:
- The tidy text format
- Sentiment analysis with tidy data
- Analyzing word and document frequency: tf-idf
- Relationships between words: n-grams and correlations
- Converting to and from non-tidy formats
- Topic modeling
- Case study: comparing Twitter archives
- Case study: mining NASA metadata
- Case study: analyzing usenet text
Text Mining for R: A Tidy Approach is code-heavy and seems to explain concepts well. The focus is on practical implementation, which should be of no surprise given the book's title, and to an R novice it seems to do a very good job. I have not followed along to the entire book, but I did read the first 2 chapters and feel that I got out of it what was intended.
The book is also very transparent as to what it is not:
This book serves as an introduction to the tidy text mining framework along with a collection of examples, but it is far from a complete exploration of natural language processing. The CRAN Task View on Natural Language Processing provides details on other ways to use R for computational linguistics. There are several areas that you may want to explore in more detail according to your needs.
- Clustering, classification, and prediction
- Word embedding
- More complex tokenization
- Languages other than English
All in all, this seems to strike a good balance. If you aren't familiar with NLP to any degree, regardless as to your familiarity with the tidyverse, jumping into the deep end with complex tokenization and using word embeddings to solve problems probably isn't a good idea. The starting point really should be what this book lays out, and what it lays out well.
It's at this point I should tell you that this is not actually an eBook; Text Mining with R is an online version of the print book. You can read the book online, and you can also buy physical copies from Amazon.
Whether you are interested in applying text mining to your projects and currently reside in the world of R, or you are looking to venture into using R and need some direction in doing so, check out Text Mining for R: A Tidy Approach. I'm certain you will find it beneficial.
- Statistics with Julia: The Free eBook
- Causal Inference: The Free eBook
- Data Mining and Machine Learning: Fundamental Concepts and Algorithms: The Free eBook