Next Generation Data Manipulation with R and dplyr
The idea behind the dplyr package is to do one thing at a time. dplyr has separate functions for every task which make its implementation crisp and easy to understand.
By Chaitanya Sagar, Perceptive Analytics.
They say gone are the days of slow and old technologies and one should adopt new methods. Well, the developers at R took this seriously and made dplyr package as one of their offerings for a package with fast and robust computing. dplyr is also known as the successor of the plyr package, only better, faster and stronger. The proof? A lot of code for the package is written in Rcpp which is the key to speed. The package also supports databases such as MySQL, SQLite, PostgreSQL and google bigquery. This article looks into some of the functions in the package and develop a familiarity with it.
The idea behind the dplyr package is to do one thing at a time. Are you a programmer who likes to write short code with complicated lines? Then the package is not for you. dplyr has separate functions for every task which make its implementation crisp and easy to understand.
R’s dplyr In Action
Let’s try using different dplyr functions with the help of the air quality dataset which is built in R. The air quality dataset displays measurements of air quality in New York for the period May’73 to Sep’73. We have a total of 153 rows to work on. This dataset is available in R.
Viewing Subsets of Data - filter() Function
The filter function is used to view data which satisfies a certain condition you specify. This is useful in situations when you want to see speckles of data without the hassle of opening it up in excel sheets and applying filters. Similar to classic filtering functionality, one can apply multiple filters and combine them using & (for AND) and | (for OR) logical operands. This is air-quality data so let’s see what we have first.
There are 37 NA’s in Ozone and 7 in Solar.R. How about seeing the values of the month of May? We use the filter function now.
So four out of seven missing values for Solar.R occurred in this month. Among the 31 values, three are some days when the ozone value is too low. Let’s see how many days the value is less than the 1st quartile.
Ozone values are quite low for 12 days in May. Is it too high? Let’s see all the days when Ozone was this low in other months.
So we have 16 days in the remaining months with low Ozone levels but 12 days in Ozone month itself. Let’s see how the Solar.R values change with changing low levels of Ozone. We’ll go back to the month of May for now.
Wait! This requires two operations at the same time! We need to first filter rows with low ozone levels in the month of may and then arrange the values in the order of increasing Ozone levels. The arrange() function can sort the values and we will use a special functionality of piping to combine filter() and arrange.
Sorting Values - arrange() Function and The Pipe Operator
When the dplyr library is loaded, a few more packages are imported as well. One of them is magrittr which contains the pipe operator. The pipe operator %>% is simply used to run a chain of operations. When I use multiple operations and combine them using this operator, the functions are executed from left to right just as in a pipe. Let’s filter the low Ozone levels for the month of May and arrange the Ozone levels so that we can read the Solar. R trend with increasing Ozone. The arrange() function does this task. Since we are piping, we can write the dataset first and then pipe both of our functions to it.
At least the Solar.R values do not linearly increase with Ozone levels. Let’s see the wind and temp if there is any relation between them. If we are just working with Wind and Temp, we don’t need to see the other variables. The select function can do the task of selecting the desired features for us.
Increasing Your Focus - select() Function
Creating New Data - mutate() Function
From the previous output, it can be noticed that there is no direct relation between Wind and Temperature. However, the Temp values tend to be lower for increasing wind. Why don’t we use the mutate() function to change the Temp function to Celsius and observe the same. Celsius temperature has a shorter range and will make it easier to pinpoint the existence of the trend.
It is very easy and convenient to keep on adding functions using pipe operator and have an easy to interpret code with clean output.
Let’s go into the advanced functions from the package - summarize and group by.
Divide and Rule - summarize() and group_by() Functions
We already have the summary function to give us the basic statistics about each feature in our data. The inbuilt R function gives us mean, median, min, max and quartiles of each feature along with the number of NA values but the summarize() and group_by() functions from dplyr can do much more. Let’s calculate the mean temperature for each month. With dplyr, one has to simply apply a group_by() over month and then summarize() to find the mean. With so many functions in a single package, all your code can be made more efficient and easy to understand. Use the package meticulously and the mundane tasks of data cleaning and exploration will be much better than ever before.
The months of July and August are really hot looking at the temperature. group_by() can also be used with other functions to produce results. For instance, we can also add the mean_temp features to our data.
There are a host of other functions one can use to summarize:
- n() - find the length of the vector
- n_distinct() - number of distinct values in the vector
- first() - return the first value in the vector
- last() - return the last value in the vector
- sd() - return the standard deviation
- min() - find the minimum
- max() - find the maximum
- median() - find the median
- sum() - find the sum
This is Not the End
The dplyr function is not limited to filter(), mutate(), arrange(), summarize() and group_by() but has a host of functions. One can use joins(inner_join, left_join, right_join, full_join, semi_join and anti_join) or if_else which preserves the data types or even extract samples (using sample_n() or sample_frac() and count rows. All of these functions take seconds while their alternatives do the same in minutes or hours. dplyr makes data manipulation easy, consistent and fast.
Bio: Chaitanya Sagar is the Founder and CEO of Perceptive Analytics. Perceptive Analytics has been chosen as one of the top 10 analytics companies to watch out for by Analytics India Magazine. It works on Marketing Analytics for e-commerce, Retail and Pharma companies.