2018’s Top 7 R Packages for Data Science and AI
This is a list of the best packages that changed our lives this year, compiled from my weekly digests.
Editor's note: This post covers Favio's selections for the top 7 R packages of 2018. Yesterday's post covered his top 7 Python libraries of the year.
If you follow me, you know that this year I started a series called Weekly Digest for Data Science and AI: Python & R, where I highlighted the best libraries, repos, packages, and tools that help us be better data scientists for all kinds of tasks.
The great folks at Heartbeat sponsored a lot of these digests, and they asked me to create a list of the best of the best—those libraries that really changed or improved the way we worked this year (and beyond).
If you want to read the past digests, take a look here:
Disclaimer: This list is based on the libraries and packages I reviewed in my personal newsletter. All of them were trending in one way or another among programmers, data scientists, and AI enthusiasts. Some of them were created before 2018, but if they were trending, they could be considered.
Top 7 for R
7. infer — An R package for tidyverse-friendly statistical inference
Inference, or statistical inference, is the process of using data analysis to deduce properties of an underlying probability distribution.
The objective of this package is to perform statistical inference using an expressive statistical grammar that coheres with the
tidyverse design framework.
To install the current stable version of
infer from CRAN:
Let’s try a simple example on the
mtcars dataset to see what the library can do for us.
First, let’s overwrite
mtcars so that the variables
We’ll try hypothesis testing. Here, a hypothesis is proposed so that it’s testable on the basis of observing a process that’s modeled via a set of random variables. Normally, two statistical data sets are compared, or a data set obtained by sampling is compared against a synthetic data set from an idealized model.
Here, we first specify the response and explanatory variables, then we declare a null hypothesis. After that, we generate resamples using bootstrap and finally calculate the median. The result of that code is:
One of the greatest parts of this library is the
visualize function. This will allow you to visualize the distribution of the simulation-based inferential statistics or the theoretical distribution (or both). For an example, let’s use the flights data set. First, let’s do some data preparation:
And now we can run a randomization approach to
or see the theoretical distribution:
For more on this package visit:
6. janitor — simple tools for data cleaning in R
Data cleansing is a topic very close to me. I’ve been working with my team at Iron-AI to create a tool for Python called Optimus. You can see more about it here:
Data cleansing and exploration with Python and Apache Spark — Big Data and Data Science — Optimus
The group of BBVA Data & Analytics in Mexico has been using Optimus for the past months and we have boosted our…hioptimus.com
But this tool I’m showing you is a very cool package with simple functions for data cleaning.
It has three main functions:
- perfectly format
- create and format frequency tables of one, two, or three variables (think an improved
- isolate partially-duplicate records.
I’m using the example from the repo, and the data
clean_names() function, we’re telling R that we’re about to use janitor. Then we clean the empty rows and columns, and then using dplyr we change the format for the dates, create a new column with the information of
certification_1, and then drop them.
And with this piece of code…
we can find duplicated records that have the same name and last name.
The package also introduces the tabyl function that tabulates the data, like table but pipe-able, data.frame-based, and fully featured. For example:
You can do a lot more things with the package, so visit their site and give them some love :)
5. Esquisse — RStudio add-in to make plots with ggplot2
This add-in allows you to interactively explore your data by visualizing it with the ggplot2 package. It allows you to draw bar graphs, curves, scatter plots, and histograms, and then export the graph or retrieve the code generating the graph.
Install from CRAN with :
Then launch the add-in via the RStudio menu. If you don’t have
data.framein your environment, datasets in
ggplot2 are used.
ggplot2 builder addin
Launch the add-in via the RStudio menu or with:
The first step is to choose a
Or you can use a dataset directly with:
After that, you can drag and drop variables to create a plot:
You can find information about the package and sub-menus in the original repo: