Ten more random useful things in R you may not know about
I had a feeling that R has developed as a language to such a degree that many of us are using it now in completely different ways. This means that there are likely to be numerous tricks, packages, functions, etc that each of us use, but that others are completely unaware of, and would find useful if they knew about them.
By Keith McNulty, McKinsey & Company
I was surprised by the positive reaction to my article a couple of months agowhich itemized ten random things in R that people might not know about.
I had a feeling that R has developed as a language to such a degree that many of us are using it now in completely different ways. This means that there are likely to be numerous tricks, packages, functions, etc that each of us use, but that others are completely unaware of, and would find useful if they knew about them. As Mike Kearney also pointed out, none of my list of ten had anything to do with stats, which just shows how far R has come in recent years.
To be honest, I struggled to keep it to ten last time, so here are ten more things about R that help make my work easier and which you might find useful. Do drop a note here or on Twitter if any of these helpful to your current work, or if you have further suggestions for things which others should know about.
dbplyr is exactly what its name implies. It allows you use
dplyr with databases. If you work with databases and you’ve never heard of
dbplyr then you are likely still using SQL strings in your code, which forces you to think SQL when you actually want to think tidy, and can be a real pain when you want to abstract your code to generate functions and the like.
dbplyr allows you to create your SQL query using
dplyr. It does this by establishing a database table that can be manipulated using
dplyr functions, translating those functions into SQL. For example, if you have a database connection called
con, and you want to manipulate a table called
CAT_SCHEMA you can set this table up as:
Then you can perform the usual manipulations such as
summarise etc on
cat_table and all of these will be translated into a SQL query in the background. What’s incredibly useful is that the data is not physically downloaded into your R session until you use the
dplyr::collect()function to finally grab it. This means you can get SQL to do all the work and collect your manipulated data at the end, rather than having to pull the entire database at the beginning.
2. rvest and xml2
People say Python is much better for web scraping. That may be true. But for those of us who like working in the tidyverse, the
xml2 packages can make straightforward web scraping pretty easy by working with
magrittr and allowing us to pipe commands. Given that HTML and XML code on webpages is usually heavily nested, I think its pretty intuitive to structure scraping code using
By initially reading the HTML code of the page of interest, these packages break the nested HTML and XML nodes into lists that you can progressively search and mine for specific nodes or attributes of interest. Using this in combination with Chrome’s inspect capability will allow you to quickly extract the key information you need from the webpage.
As a quick example, I recently wrote a function to scrape the basic Billboard music chart at any point in history as a dataframe from this fairly snazzy pageusing code as simple as this:
3. k-means on long data
k-means is an increasingly popular statistical method to cluster observations in data, often to simplify a large number of datapoints into a smaller number of clusters or archetypes. The
kml package now allows k-means clustering to take place on longitudinal data, where the ‘datapoints’ are actually data series.
This is super useful where the datapoints you are studying are actually readings over time. This could be clinical observation of weight gain or loss in hospital patients, or compensation trajectories of employees.
kml works by first transforming data into an object of the class
cld function. Then it partitions the data using a ‘hill climbing’ algorithm, testing several values of
k 20 times each. Finally, the
choice()function allows you to view the results of the algorithm for each
k graphically and decide what you believe to be an optimal clustering.
4. The connections window in RStudio
The connections window in the latest version of RStudio allows you to browse any remote databases without having to move into a separate environment like SQL developer. This convenience now offers the opportunity to fulfil dev projects entirely within the RStudio IDE.
By setting up your connection to a remote database in the connections window, you can browse inside nested schemas, tables, data types, and even view a table directly to see an extract of what the data looks like.
More on the connections window here.
The default behavior in R dataframes is that if no data exists for a particular observation, then the row for that observation does not appear in the dataframe. This can cause problems when you need to use this dataframe as an input for something which expects to see values for all possible observations.
Typically this problem occurs when you are sending the data into some graphing function that is expecting to see zero values when there are no observations, and can’t understand that a missing row means zero values in that row. This can also be an issue when you are making future projections and the starting point has missing rows.
complete() function within
tidyr allows you to fill in the gaps for all observations that had no data. It allows you to define the observations that you want to complete and then declare what value to use to plug the gaps. For example, if you were taking counts of male and female dogs of different breeds, and you had some combinations for which there were no dogs in the sample, you could use the following to deal with it:
This will expand your dataframe to ensure that all possible combinations of
BREED are included, and it will fill in missing values of
COUNT with zeros.
Animated graphics are all the rage at the moment, and the
gganimate package allows those who use
ggplot2 (most R users I would say) to very simply extend their code to create animated graphics.
gganimate works by taking data that exists over a series of ‘transition states’, usually years or some other sort of time series data. You can plot the data within each transition state as if it were a simple static
ggplot2 chart, and then use the
ease_aes() function to create an animation that moves between the transition states. There are numerous options for how the transition occurs, and the
animate() function allows the graphic to be rendered in a variety of forms such as an animated gif or an mpeg.
As an example, here’s a gif I created that shows all time points won by entrants in the Eurovision Song contest from 1957 to 2018:
R2D3, which is great not least because it allows us to admire one of the best hex stickers ever (see here).
My favourite D3 package for R is
networkD3. It has been around for a little while and is fantastic for plotting graph or network data in a responsive or aesthetically pleasing way. In particular, it can plot force directed networks using
forceNetwork(), sankey diagrams using
sankeyNetwork() and chord diagrams using
chordNetwork(). Here’s an example of a simple sankey network I created showing voting flows by region in the Brexit referendum.
8. Datatables in RMarkdown or Shiny using DT
For example, a simple command such as :
Can produce something as nice as this:
More on DT here, including how to set various options to customize the layout and add data download, copy and print buttons.
9. Pimp your RMarkdown with prettydoc
prettydoc is a package by Yixuan Qiu which offers a simple set of themes to create a different, prettier look and feel to your RMarkdown documents. This is super helpful when you just want to jazz up your documents a little but don’t have time to get into the styling of them yourself.
It’s really easy to use. Simple edits to the YAML header of your document can invoke a specific style theme throughout the document, with numerous themes available. For example, this will invoke a lovely clean blue coloring and style across titles, tables, embedded code and graphics:
10. Optionally hide your code in RMarkdown with code_folding
RMarkdown is a great way to record your work, allowing you to write a narrative and capture your code all in one place. But sometimes your code can be overwhelming and not particularly pleasant for non-coders who are trying to read just the narrative of your work and are not interested in the intricacies of how you conducted the analysis.
Previously the only options we had were to either set
echo = TRUE or
echo = FALSE in our
knitr options to either show our code in the document or not. But now we can set an option in the YAML header that gives us the best of both worlds. Setting
code_folding: hide in the YAML header will hide the code chunks by default, but provide little click-down boxes in the document so that the reader can view all the code, or particular chunks, as and when they want to, like this:
So that wraps up my next ten random R tips. I hope some of these were helpful, and please feel free to add any of your own tips to the comments for other users to read.
Originally I was a Pure Mathematician, then I became a Psychometrician and a Data Scientist. I am passionate about applying the rigor of all those disciplines to complex people questions. I’m also a coding geek and a massive fan of Japanese RPGs. Find me on LinkedIn or on Twitter.
Bio: Keith McNulty is a Data Scientist at McKinsey & Company.
Original. Reposted with permission.
- Ten random useful things in R that you might not know about
- How to Make Stunning 3D Plots for Better Storytelling
- The Evolution of a ggplot