Ten more random useful things in R you may not know about

By Keith McNulty, McKinsey & Company

I was surprised by the positive reaction to my article a couple of months agowhich itemized ten random things in R that people might not know about.

I had a feeling that R has developed as a language to such a degree that many of us are using it now in completely different ways. This means that there are likely to be numerous tricks, packages, functions, etc that each of us use, but that others are completely unaware of, and would find useful if they knew about them. As Mike Kearney also pointed out, none of my list of ten had anything to do with stats, which just shows how far R has come in recent years.

To be honest, I struggled to keep it to ten last time, so here are ten more things about R that help make my work easier and which you might find useful. Do drop a note here or on Twitter if any of these helpful to your current work, or if you have further suggestions for things which others should know about.

1. dbplyr

dbplyr is exactly what its name implies. It allows you use dplyr with databases. If you work with databases and you’ve never heard of dbplyr then you are likely still using SQL strings in your code, which forces you to think SQL when you actually want to think tidy, and can be a real pain when you want to abstract your code to generate functions and the like.

dbplyr allows you to create your SQL query using dplyr. It does this by establishing a database table that can be manipulated using dplyr functions, translating those functions into SQL. For example, if you have a database connection called con, and you want to manipulate a table called CAT_DATAwithin CAT_SCHEMA you can set this table up as:

cat_table <- dplyr::tbl(
  con,
  dbplyr::in_schema("CAT_SCHEMA", "CAT_TABLE")
)

Then you can perform the usual manipulations such as filter, mutate, group_by, summarise etc on cat_table and all of these will be translated into a SQL query in the background. What’s incredibly useful is that the data is not physically downloaded into your R session until you use the dplyr::collect()function to finally grab it. This means you can get SQL to do all the work and collect your manipulated data at the end, rather than having to pull the entire database at the beginning.

For more on dbplyr you can check my previous article here and the tutorial here.

2. rvest and xml2

People say Python is much better for web scraping. That may be true. But for those of us who like working in the tidyverse, the rvest and xml2 packages can make straightforward web scraping pretty easy by working with magrittr and allowing us to pipe commands. Given that HTML and XML code on webpages is usually heavily nested, I think its pretty intuitive to structure scraping code using %>%.

By initially reading the HTML code of the page of interest, these packages break the nested HTML and XML nodes into lists that you can progressively search and mine for specific nodes or attributes of interest. Using this in combination with Chrome’s inspect capability will allow you to quickly extract the key information you need from the webpage.

As a quick example, I recently wrote a function to scrape the basic Billboard music chart at any point in history as a dataframe from this fairly snazzy pageusing code as simple as this:

get_chart <- function(date = Sys.Date(), positions = c(1:10), type = "hot-100") {     # get url from input and read html
  input <- paste0("https://www.billboard.com/charts/", type, "/", date)     chart_page <- xml2::read_html(input)       # scrape data
  chart <- chart_page %>%
    rvest::html_nodes('body') %>%
    xml2::xml_find_all("//div[contains(@class, 'chart-list-item  ')]")        rank <- chart %>%
    xml2::xml_attr('data-rank')      artist <- chart %>%
    xml2::xml_attr('data-artist')      title <- chart %>%
    xml2::xml_attr('data-title')     # create dataframe, remove nas and return result
  chart_df <- data.frame(rank, artist, title)
  chart_df <- chart_df %>%
    dplyr::filter(!is.na(rank), rank %in% positions)   chart_df
}

More on this example here, more on rvest here and more on xml2 here.

3. k-means on long data

k-means is an increasingly popular statistical method to cluster observations in data, often to simplify a large number of datapoints into a smaller number of clusters or archetypes. The kml package now allows k-means clustering to take place on longitudinal data, where the ‘datapoints’ are actually data series.

This is super useful where the datapoints you are studying are actually readings over time. This could be clinical observation of weight gain or loss in hospital patients, or compensation trajectories of employees.

kml works by first transforming data into an object of the class ClusterLongDatausing the cld function. Then it partitions the data using a ‘hill climbing’ algorithm, testing several values of k 20 times each. Finally, the choice()function allows you to view the results of the algorithm for each k graphically and decide what you believe to be an optimal clustering.

4. The connections window in RStudio

The connections window in the latest version of RStudio allows you to browse any remote databases without having to move into a separate environment like SQL developer. This convenience now offers the opportunity to fulfil dev projects entirely within the RStudio IDE.

By setting up your connection to a remote database in the connections window, you can browse inside nested schemas, tables, data types, and even view a table directly to see an extract of what the data looks like.

The connections window in the latest versions of RStudio

5. tidyr::complete()

The default behavior in R dataframes is that if no data exists for a particular observation, then the row for that observation does not appear in the dataframe. This can cause problems when you need to use this dataframe as an input for something which expects to see values for all possible observations.

Typically this problem occurs when you are sending the data into some graphing function that is expecting to see zero values when there are no observations, and can’t understand that a missing row means zero values in that row. This can also be an issue when you are making future projections and the starting point has missing rows.

The complete() function within tidyr allows you to fill in the gaps for all observations that had no data. It allows you to define the observations that you want to complete and then declare what value to use to plug the gaps. For example, if you were taking counts of male and female dogs of different breeds, and you had some combinations for which there were no dogs in the sample, you could use the following to deal with it:

dogdata %>%
  tidyr::complete(SEX, BREED, fill = list(COUNT = 0))

This will expand your dataframe to ensure that all possible combinations of SEXand BREED are included, and it will fill in missing values of COUNT with zeros.

6. gganimate

Animated graphics are all the rage at the moment, and the gganimate package allows those who use ggplot2 (most R users I would say) to very simply extend their code to create animated graphics.

gganimate works by taking data that exists over a series of ‘transition states’, usually years or some other sort of time series data. You can plot the data within each transition state as if it were a simple static ggplot2 chart, and then use the ease_aes() function to create an animation that moves between the transition states. There are numerous options for how the transition occurs, and the animate() function allows the graphic to be rendered in a variety of forms such as an animated gif or an mpeg.

As an example, here’s a gif I created that shows all time points won by entrants in the Eurovision Song contest from 1957 to 2018:

Using gganimate to show all time Eurovision Song Contest results

For the code for this see here and for a nice step by step tutorial on gganimatewhich I found really helpful see here.

7. networkD3

D3 is an extremely powerful data visualization library for javascript. An increasing number of packages have started becoming available that allow R users to build viz in D3 such as R2D3, which is great not least because it allows us to admire one of the best hex stickers ever (see here).

My favourite D3 package for R is networkD3. It has been around for a little while and is fantastic for plotting graph or network data in a responsive or aesthetically pleasing way. In particular, it can plot force directed networks using forceNetwork(), sankey diagrams using sankeyNetwork() and chord diagrams using chordNetwork(). Here’s an example of a simple sankey network I created showing voting flows by region in the Brexit referendum.

Voting flows in the Brexit referendum using networkD3

More on this specific example here and more on networkD3 here.

8. Datatables in RMarkdown or Shiny using DT

The DT package is an interface from R to the DataTables javascript library. This allows very easy display of tables within a shiny app or R Markdown document that have a lot of in-built functionality and responsiveness. This prevents you from having to code separate data download functions, gives the user flexibility around the presentation and the ordering of the data and has a data search capability built-in.

For example, a simple command such as :

DT::datatable(
  head(iris),
  caption = 'Table 1: This is a simple caption for the table.'
)

Can produce something as nice as this:

More on DT here, including how to set various options to customize the layout and add data download, copy and print buttons.

9. Pimp your RMarkdown with prettydoc

prettydoc is a package by Yixuan Qiu which offers a simple set of themes to create a different, prettier look and feel to your RMarkdown documents. This is super helpful when you just want to jazz up your documents a little but don’t have time to get into the styling of them yourself.

It’s really easy to use. Simple edits to the YAML header of your document can invoke a specific style theme throughout the document, with numerous themes available. For example, this will invoke a lovely clean blue coloring and style across titles, tables, embedded code and graphics:

---
title: "My doc"
author: "Me"
date: June 3, 2019
output:
  prettydoc::html_pretty:
    theme: architect
    highlight: github
---

10. Optionally hide your code in RMarkdown with code_folding

RMarkdown is a great way to record your work, allowing you to write a narrative and capture your code all in one place. But sometimes your code can be overwhelming and not particularly pleasant for non-coders who are trying to read just the narrative of your work and are not interested in the intricacies of how you conducted the analysis.

Previously the only options we had were to either set echo = TRUE or echo = FALSE in our knitr options to either show our code in the document or not. But now we can set an option in the YAML header that gives us the best of both worlds. Setting code_folding: hide in the YAML header will hide the code chunks by default, but provide little click-down boxes in the document so that the reader can view all the code, or particular chunks, as and when they want to, like this:

Code folding drop downs in R Markdown

So that wraps up my next ten random R tips. I hope some of these were helpful, and please feel free to add any of your own tips to the comments for other users to read.

Originally I was a Pure Mathematician, then I became a Psychometrician and a Data Scientist. I am passionate about applying the rigor of all those disciplines to complex people questions. I’m also a coding geek and a massive fan of Japanese RPGs. Find me on LinkedIn or on Twitter.

Bio: Keith McNulty is a Data Scientist at McKinsey & Company.

Original. Reposted with permission.

Related: