KDnuggets Home » News » 2017 » Aug » Tutorials, Overviews » Next Generation Data Manipulation with R and dplyr ( 17:n34 )

Next Generation Data Manipulation with R and dplyr


The idea behind the dplyr package is to do one thing at a time. dplyr has separate functions for every task which make its implementation crisp and easy to understand.



By Chaitanya Sagar, Perceptive Analytics.

dplyrThey say gone are the days of slow and old technologies and one should adopt new methods. Well, the developers at R took this seriously and made dplyr package as one of their offerings for a package with fast and robust computing. dplyr is also known as the successor of the plyr package, only better, faster and stronger. The proof? A lot of code for the package is written in Rcpp which is the key to speed. The package also supports databases such as MySQL, SQLite, PostgreSQL and google bigquery. This article looks into some of the functions in the package and develop a familiarity with it.

The idea behind the dplyr package is to do one thing at a time. Are you a programmer who likes to write short code with complicated lines? Then the package is not for you. dplyr has separate functions for every task which make its implementation crisp and easy to understand.

 

R’s dplyr In Action

 
Let’s try using different dplyr functions with the help of the air quality dataset which is built in R. The air quality dataset displays measurements of air quality in New York for the period May’73 to Sep’73. We have a total of 153 rows to work on. This dataset is available in R.

Viewing Subsets of Data - filter() Function

The filter function is used to view data which satisfies a certain condition you specify. This is useful in situations when you want to see speckles of data without the hassle of opening it up in excel sheets and applying filters. Similar to classic filtering functionality, one can apply multiple filters and combine them using & (for AND) and | (for OR) logical operands. This is air-quality data so let’s see what we have first.

> attach(airquality)
> summary(airquality)

     Ozone           Solar.R           Wind             Temp           Month            Day      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00   Min.   :5.000   Min.   : 1.0  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00   1st Qu.:6.000   1st Qu.: 8.0  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00   Median :7.000   Median :16.0  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88   Mean   :6.993   Mean   :15.8  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00   3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00   Max.   :9.000   Max.   :31.0  
 NA's   :37       NA's   :7


There are 37 NA’s in Ozone and 7 in Solar.R. How about seeing the values of the month of May? We use the filter function now.

> library(dplyr)
#Use the filter function to get all observations for the month of May
> filter(airquality,Month==5)

   Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
5     NA      NA 14.3   56     5   5
6     28      NA 14.9   66     5   6
7     23     299  8.6   65     5   7
8     19      99 13.8   59     5   8
9      8      19 20.1   61     5   9
10    NA     194  8.6   69     5  10
11     7      NA  6.9   74     5  11
12    16     256  9.7   69     5  12
13    11     290  9.2   66     5  13
14    14     274 10.9   68     5  14
15    18      65 13.2   58     5  15
16    14     334 11.5   64     5  16
17    34     307 12.0   66     5  17
18     6      78 18.4   57     5  18
19    30     322 11.5   68     5  19
20    11      44  9.7   62     5  20
21     1       8  9.7   59     5  21
22    11     320 16.6   73     5  22
23     4      25  9.7   61     5  23
24    32      92 12.0   61     5  24
25    NA      66 16.6   57     5  25
26    NA     266 14.9   58     5  26
27    NA      NA  8.0   57     5  27
28    23      13 12.0   67     5  28
29    45     252 14.9   81     5  29
30   115     223  5.7   79     5  30
31    37     279  7.4   76     5  31


So four out of seven missing values for Solar.R occurred in this month. Among the 31 values, three are some days when the ozone value is too low. Let’s see how many days the value is less than the 1st quartile.

#Use the filter function to get all observations when Ozone is less than the 1st quartile for the month of May
> filter(airquality,Month==5 & Ozone<18)

   Ozone Solar.R Wind Temp Month Day
1     12     149 12.6   74     5   3
2      8      19 20.1   61     5   9
3      7      NA  6.9   74     5  11
4     16     256  9.7   69     5  12
5     11     290  9.2   66     5  13
6     14     274 10.9   68     5  14
7     14     334 11.5   64     5  16
8      6      78 18.4   57     5  18
9     11      44  9.7   62     5  20
10     1       8  9.7   59     5  21
11    11     320 16.6   73     5  22
12     4      25  9.7   61     5  23


Ozone values are quite low for 12 days in May. Is it too high? Let’s see all the days when Ozone was this low in other months.

#Use the filter function to get all observations when Ozone is less than the 1st quartile except for the month of May
> filter(airquality,Month!=5 & Ozone<18)

1     12     120 11.5   73     6  19
2     13     137 10.3   76     6  20
3     10     264 14.3   73     7  12
4      7      48 14.3   80     7  15
5     16       7  6.9   74     7  21
6      9      24 13.8   81     8   2
7     16      77  7.4   82     8   3
8      9      36 14.3   72     8  22
9      9      24 10.9   71     9  14
10    13     112 11.5   71     9  15
11    13      27 10.3   76     9  18
12    16     201  8.0   82     9  20
13    13     238 12.6   64     9  21
14     7      49 10.3   69     9  24
15    14      20 16.6   63     9  25
16    14     191 14.3   75     9  28


So we have 16 days in the remaining months with low Ozone levels but 12 days in Ozone month itself. Let’s see how the Solar.R values change with changing low levels of Ozone. We’ll go back to the month of May for now.

Wait! This requires two operations at the same time! We need to first filter rows with low ozone levels in the month of may and then arrange the values in the order of increasing Ozone levels. The arrange() function can sort the values and we will use a special functionality of piping to combine filter() and arrange.

Sorting Values - arrange() Function and The Pipe Operator

When the dplyr library is loaded, a few more packages are imported as well. One of them is magrittr which contains the pipe operator. The pipe operator %>% is simply used to run a chain of operations. When I use multiple operations and combine them using this operator, the functions are executed from left to right just as in a pipe. Let’s filter the low Ozone levels for the month of May and arrange the Ozone levels so that we can read the Solar. R trend with increasing Ozone. The arrange() function does this task. Since we are piping, we can write the dataset first and then pipe both of our functions to it.

#using the arrange function piped with filter function
> airquality %>% 
      filter(Month!=5 & Ozone<18) %>% 
      arrange(Ozone)

   Ozone Solar.R Wind Temp Month Day
1      7      48 14.3   80     7  15
2      7      49 10.3   69     9  24
3      9      24 13.8   81     8   2
4      9      36 14.3   72     8  22
5      9      24 10.9   71     9  14
6     10     264 14.3   73     7  12
7     12     120 11.5   73     6  19
8     13     137 10.3   76     6  20
9     13     112 11.5   71     9  15
10    13      27 10.3   76     9  18
11    13     238 12.6   64     9  21
12    14      20 16.6   63     9  25
13    14     191 14.3   75     9  28
14    16       7  6.9   74     7  21
15    16      77  7.4   82     8   3
16    16     201  8.0   82     9  20


At least the Solar.R values do not linearly increase with Ozone levels. Let’s see the wind and temp if there is any relation between them. If we are just working with Wind and Temp, we don’t need to see the other variables. The select function can do the task of selecting the desired features for us.

Increasing Your Focus - select() Function

#using the select function piped with arrange function to see if there is any relation with wind and temp for the first ten days of May
> airquality %>% 
       filter(Month==5 & Day<=10) %>%
       select(Wind,Temp) %>% 
       arrange(Wind)

   Wind Temp
1   7.4   67
2   8.0   72
3   8.6   65
4   8.6   69
5  11.5   62
6  12.6   74
7  13.8   59
8  14.3   56
9  14.9   66
10 20.1   61


Creating New Data - mutate() Function

From the previous output, it can be noticed that there is no direct relation between Wind and Temperature. However, the Temp values tend to be lower for increasing wind. Why don’t we use the mutate() function to change the Temp function to Celsius and observe the same. Celsius temperature has a shorter range and will make it easier to pinpoint the existence of the trend.

#Using the filter(), mutate(), select() and arrange() functions to get Wind and TempC where TempC is the temperature in Celsius
> airquality %>% 
       filter(Month==5 & Day<=10) %>%
       mutate(TempC = (Temp - 32) * 5 / 9) %>%
       select(Wind,TempC) %>% 
       arrange(Wind)

   Wind    TempC
1   7.4 19.44444
2   8.0 22.22222
3   8.6 18.33333
4   8.6 20.55556
5  11.5 16.66667
6  12.6 23.33333
7  13.8 15.00000
8  14.3 13.33333
9  14.9 18.88889
10 20.1 16.11111


It is very easy and convenient to keep on adding functions using pipe operator and have an easy to interpret code with clean output.

Let’s go into the advanced functions from the package - summarize and group by.

Divide and Rule - summarize() and group_by() Functions

We already have the summary function to give us the basic statistics about each feature in our data. The inbuilt R function gives us mean, median, min, max and quartiles of each feature along with the number of NA values but the summarize() and group_by() functions from dplyr can do much more. Let’s calculate the mean temperature for each month. With dplyr, one has to simply apply a group_by() over month and then summarize() to find the mean. With so many functions in a single package, all your code can be made more efficient and easy to understand. Use the package meticulously and the mundane tasks of data cleaning and exploration will be much better than ever before.

#Find the mean temp for each month
> airquality	%>%
       group_by(Month) %>%
       summarize(mean_temp=mean(Temp))

# A tibble: 5 x 2
  Month mean_temp
       
1     5  65.54839
2     6  79.10000
3     7  83.90323
4     8  83.96774
5     9  76.90000


The months of July and August are really hot looking at the temperature. group_by() can also be used with other functions to produce results. For instance, we can also add the mean_temp features to our data.

#Create a new dataset with mean_temp from each month
> airquality2 = airquality %>%
                group_by(Month)	%>%
                mutate(mean_temp=mean(Temp))

head(airquality2)

# A tibble: 6 x 7
# Groups:   Month [1]
  Ozone Solar.R  Wind  Temp Month   Day mean_temp
              
1    41     190   7.4    67     5     1  65.54839
2    36     118   8.0    72     5     2  65.54839
3    12     149  12.6    74     5     3  65.54839
4    18     313  11.5    62     5     4  65.54839
5    NA      NA  14.3    56     5     5  65.54839
6    28      NA  14.9    66     5     6  65.54839


There are a host of other functions one can use to summarize:

  • n() - find the length of the vector
  • n_distinct() - number of distinct values in the vector
  • first() - return the first value in the vector
  • last() - return the last value in the vector
  • sd() - return the standard deviation
  • min() - find the minimum
  • max() - find the maximum
  • median() - find the median
  • sum() - find the sum

This is Not the End

The dplyr function is not limited to filter(), mutate(), arrange(), summarize() and group_by() but has a host of functions. One can use joins(inner_join, left_join, right_join, full_join, semi_join and anti_join) or if_else which preserves the data types or even extract samples (using sample_n() or sample_frac() and count rows. All of these functions take seconds while their alternatives do the same in minutes or hours. dplyr makes data manipulation easy, consistent and fast.

 
Bio: Chaitanya Sagar is the Founder and CEO of Perceptive Analytics. Perceptive Analytics has been chosen as one of the top 10 analytics companies to watch out for by Analytics India Magazine. It works on Marketing Analytics for e-commerce, Retail and Pharma companies.

Related: