How to tackle common data cleaning issues in R
R is a great choice for manipulating, cleaning, summarizing, producing probability statistics, and so on. In addition, it's not going away anytime soon, it is platform independent, so what you create will run almost anywhere, and it has awesome help resources.
This tutorial is an excerpt from the book, Statistics for Data Science written by James D. Miller and published by Packt Publishing.
R is a language and environment that is easy to learn, very flexible in nature, and very focused on statistical computing, making it a great choice for manipulating, cleaning, summarizing, producing probability statistics, and so on.
In addition, here are a few more reasons to use R for data cleaning:
- It is used by a large number of data scientists so it's not going away anytime soon
- R is platform independent, so what you create will run almost anywhere
- R has awesome help resources--just Google it, you'll see!
Editor’s Note: While the author has named the example data as ‘Gamming Data’, it is simply the gaming data that he uses to demonstrate his code.
The simplest explanation for what outliers are might be is to say that outliers are those data points that just don't fit the rest of your data. Upon observance, any data that is either very high, very low, or just unusual (within the context of your project), is an outlier. As part of data cleansing, a data scientist would typically identify the outliers and then address the outliers using a generally accepted method:
- Delete the outlier values or even the actual variable where the outliers exist
- Transform the values or the variable itself
Let's look at a real-world example of using R to identify and then address data outliers.
In the world of gaming, slot machines (a gambling machine operated by inserting coins into a slot and pulling a handle which determines the payoff) are quite popular. Most slot machines today are electronic and therefore are programmed in such a way that all their activities are continuously tracked. In our example, investors in a casino want to use this data (as well as various supplementary data) to drive adjustments to their profitability strategy. In other words, what makes for a profitable slot machine? Is it the machine's theme or its type? Are newer machines more profitable than older or retro machines? What about the physical location of the machine? Are lower denomination machines more profitable? We try to find our answers using the outliers.
We are given a collection or pool of gaming data (formatted as a comma-delimited or CSV text file), which includes data points such as the location of the slot machine, its denomination, month, day, year, machine type, age of the machine, promotions, coupons, weather, and coin-in (which is the total amount inserted into the machine less pay-outs). The first step for us as a data scientist is to review (sometimes called profile) the data, where we'll determine if any outliers exist. The second step will be to address those outliers.
Step 1 – Profiling the data
R makes this step very simple. Although there are many ways to program a solution, let us try to keep the lines of the actual program code or script to a minimum. We can begin by defining our CSV file as a variable in our R session (named
MyFile) and then reading our file into an R
In statistics, a
boxplot is a simple way to gain information regarding the shape, variability, and centre (or median) of a statistical dataset, so we'll use the
boxplot with our data to see if we can identify both the median
Coin-in and if there are any outliers. To do this, we can ask R to plot the
Coin-in value for each slot machine in our file, using the
Coin-in is the 11th column in our file so I am referring to it explicitly as a parameter of the function
boxplot. I've also added optional parameters (again, continuing the effort to stay minimal) which add headings to the visualization.
Executing the previous script yields us the following visual. Note both the median (shown by the line that cuts through the box in the
boxplot) as well as the four outliers:
Step 2 – Addressing the outliers
Now that we see the outliers do exist within our data, we can address them so that they do not adversely affect our intended study. Firstly, we know that it is illogical to have a negative
Coin-in value since machines cannot dispense more coins that have been inserted in them. Given this rule, we can simply drop any records from the file that have negative
Coin-in values. Again, R makes it easy as we'll use the
subset function to create a new version of our data.frame, one that only has records (or cases) with non-negative
We'll call our
subset data frame
Then, we'll replot to make sure we've dropped our negative outlier:
This produces a new
boxplot, as shown in the following screenshot:
We can use the same approach to drop our extreme positive
Coin-in values (those greater than $1,500) by creating yet another
subset and then replotting:
It is well-advised, as you work through various iterations of your data, that you save off most (if not just the most significant) versions of your data. You can use the R function
Most data scientists adopt a common naming convention to be used through the project (if not for all the projects). The names of your files should be as explicit as possible to save you time later. In addition, especially when working with big data, you need to be mindful of disk space.
The output of the preceding code is as follows:
Moving on, another data cleaning technique is referred to as cleaning data based upon domain expertise. This doesn't need to be complicated. The point of this technique is simply using information not found in the data. For example, previously we excluded cases with negative
Coin-in values since we know it is impossible to have a negative
Coin-in amount. Another example might be the time when Hurricane Sandy hit the northeast United States. During that period of time, the cases of most machines had very low (if not zero)
Coin-in amounts. A data scientist would probably remove all the data cases during a specific time period, based on that information.
Cross-validation is is a technique where a data scientist applies rules to data in a data pool.
Validity checking is the most common form of statistical data cleansing and is a process that both the data developer and the data scientist will most likely be (at least somewhat) familiar with.
There can be any number of validity rules used to clean the data, and these rules will depend upon the intended purpose or objective of the data scientist. Examples of these rules include: data-typing (for example, a field must be a numeric), range limitations (where numbers or dates must fall within a certain range), required (a value cannot be empty or missing), uniqueness (a field, or a combination of fields, must be unique within the data pool), set-member (this is when values must be a member of a discreet list), foreign-key (certain values found within a case must be defined as member of or meeting a particular rule), regular expression patterning (which simply means verifying that a value is formatted in a prescribed format), and cross-field validation (where combinations of fields within a case must meet a certain criteria).
Let's look at a few examples of the preceding, starting with data-typing (also known as coercion). R offers six coercion functions to make it easy:
These functions, along with a little R knowledge, can make the effort of converting a value in a data pool pretty straightforward. For example, using the previous GammingData as an example, we might discover that a new gamming results file was generated and the age value was saved as a string (or text value). To clean it, we need to convert the value to a numeric data type. We can use the following single line of R code to quickly convert those values in the file:
One point: using this simple approach, should any value be unable to be converted, it will be set to an NA value. In type conversion, the real work is understanding what type a value needs to be, and, of course, what data types are valid; R has a wide variety of data types, including scalars, vectors (numerical, character, logical), matrices, data frames, and lists.
Another area of data cleaning we'll look at here is the process of regular expression patterning. In practice, especially when working with data that is collected (or mined) from multiple sources, the data scientist surely encounters either field that are not in the desired format (for the objective at hand) or, field values that are inconsistently formatted (which potentially can yield incorrect results). Some examples can be dates, social security numbers, and telephone numbers. With dates, depending on the source, you may have to re-type (as described previously), but more often than not, you'll also need to reformat the values into a format that is usable, given your objective.
Re-typing a date is important so that R knows to use the value as an actual date and you can use the various R data functions correctly.
A common example is when data contains cases with dates that are perhaps formatted as YYYY/MM/DD and you want to perform a time series analysis showing a sum week to week, or some other operation that requires using the date value but perhaps requiring the date to be reformatted, or you just need it to be a true R date object type. So, let's assume a new Gamming file—this one with just two columns of data: Date and Coinin. This file is a dump of collected Coinin values for a single slot machine, day by day.
The records (or cases) in our new file look like the following screenshot:
A variety of cleaning examples can be used by the data scientist. Starting with verifying what data types each of these data points are. We can use the R function class to verify our file's data types. First (as we did in the previous example), we read our CSV file into an R data frame object:
Next, we can use the
class function, as shown in the following screenshot:
You can see in the preceding screenshot that we used
class to display our data types.
MyData is our data frame holding our gaming data,
Date is of type
Coinin is an
integer. So, the data frame and the integer should make sense to you, but take note that R sets our dates up for what it calls a
factor. Factors are categorical variables that are beneficial in summary statistics, plots, and regressions, but not so much as date values. To remedy this, we can use the R functions
paste as shown next:
This reformats the value of our
Data field in all our cases in one simple line of script by pulling apart the field into three segments (the month, day, and year) and then pasting the segments back together in the order we want (with a/as the separator (
sep)), as shown in the following screenshot:
We find that this line of script converts our
Data field to type
character and, finally, we can use them
as.Date function to re-data type our values to an R
With a little trial and error, you can reformat a string or character data point exactly how you want it.
Data cleaning through enhancement is another common technique where data is made complete (and perhaps more valuable) by adding related information, facts, and/or figures. The source of this additional data could be calculations using information already in the data or added from another source. There are a variety of reasons why a data scientist may take the time to enhance data.
Based upon the purpose or objective at hand, the information the data scientist adds might be used for reference, comparison, contrast, or show tendencies. Typical use cases include:
- Derived fact calculation
- Indicating the use of calendar versus fiscal year
- Converting time zones
- Currency conversions
- Adding current versus previous period indicators
- Calculating values such as the total units shipped per day
- Maintaining slowly changing dimensions
As a data scientist, you should always use scripting to enhance your data, as this approach is much better than editing a data file directly since it is less prone to errors and maintains the integrity of the original file. Also, creating scripts allows you to reapply the enhancements to multiple files and/or new versions of files received, without having to redo the same work.
For a working example, let us again go back to our
GammingData. Assume we're receiving files of the
Coinin amounts by slot machine and our gaming company now runs casinos outside of the continental United States. These locations are sending us files to be included in our statistical analysis and we've now discovered that these international files are providing the
Coinin amounts in their local currencies. To be able to correctly model the data, we'll need to convert those amounts to US dollars. Here is the scenario:
File Source: Great Britain
Currency used: GBP or Great British Pound
The formula to convert our GBP values to USD is simply the amount multiplied by an exchange rate. So, in R:
The previous line of code will accomplish what we want; however, the data scientist is left to determine which currency needs to be converted (GBP) and what the exchange rate to be used is. Not a huge deal, but one might want to experiment with creating a user-defined function that determines the rate to be used, as shown next:
Although the preceding code snippet is rather simplistic, it illustrates the point of creating logic that we can reuse later:
Finally, to make things better still, save off your function (in an R file) so that it can always be used:
Of course, in the best of all worlds, we might modify the function to look up the rate in a table or a file, based upon the country code, so that the rates can be changed to reflect the most current value and to de-couple the data from the program code.
With data harmonization, the data scientist converts, translates, or maps data values to other more desirable values, based upon the overall objective or purpose of the analysis to be performed. The most common examples of this can be gender or country code. For example, if your file has gender coded as
F, you might want to convert the data points to be consistently coded as
With country codes, the data scientist may want to plot summations for regions: North America, South America, and Europe rather than USA, CA, MX, BR, CH, GB, FR, and DE individually. In this case, he or she would be creating aggregated values:
North America = USA + CA + MX
South America = BR + CH
Europe = GB + FR + DE
To make a point, perhaps the data scientist has stitched together multiple survey files, all containing gender, called
gender.txt, but in various codes (
Female). If we tried to use the R function table, we would see the following undesirable result:
And if we visualize this with the best of expectations:
We see the following screenshot:
Once again, to solve the inconsistent coding of the data point gender, I've borrowed the concept from the example in the previous section and created a simple function to help us with our recoding:
This time, I've added the
toupper function so that we don't have to worry about the case, as well as
substr to handle values that are longer than a single character.
I am assuming the argument value will be either
Female, otherwise an error will occur.
Since R categorizes the
Gender value as data type
factor, I found it was difficult to easily make use of the simple function, so I decided to create a new R data frame object to hold our harmonized data. I've also decided to use a looping process to read through each case (record) in our file and convert it to
Now we can enjoy a more appropriate visualization by writing:
The output of the preceding code is as follows:
Most mainstream data scientists have noted the importance of standardizing data variables (changing reference data to a standard) as part of the data cleaning process before beginning a statistical study or analysis project. This is important, as, without standardization, the data points measured using different scales will most likely not contribute equally to the analysis.
If you consider that a data point within a range between 0 and 100 will outweigh a variable within a range between 0 and 1, you can understand the importance of data standardization. Using these variables without standardization in effect gives the variable with the larger range a larger weight in the analysis. To address this concern and equalize the variables, the data scientists try to transform the data into a comparable scale.
Centring (of the data points) is the most common example of data standardization (there are many others though). To center a data point, the data scientist would subtract the mean of all the data points from each individual data point in the file.
Instead of doing the mathematics, R provides the
scale function. This is a function whose default method centers and/or scales a column of numeric data in a file in one line of code. Let's look at a simple example.
Back to our slot machines! In our gaming files, you may recall that there is a field named
Coinin that contains a numeric value indicating the total dollars put into the machine. This is considered a measurement of the machine profitability. This seems like an obvious data point to use in our profitability analysis. However, these amounts may be misleading since there are machines of different denominations (in other words, some machines accept nickels while others accept dimes or dollars). Perhaps this difference in machine denominations creates an unequal scale. We can use the
scale function to address this situation. First, we see in the following screenshot, the values of
We can then write the following line of code to center our
Coin.in data points:
The value of center determines how column centring is to be performed. Using center is
TRUE causes centring to be done by subtracting the column means (omitting NAs) of
Coin.in from their corresponding columns. The value of
scale determines how column scaling is performed (after centring). If the scale is
TRUE, then scaling is done by dividing the (centered) columns of
Coin.in by their standard deviations if a center is
TRUE, and the root mean square otherwise.
We see the difference in the following screenshot:
You enjoyed an excerpt from the book, Statistics for Data Science written by James D. Miller and published by Packt Publishing.
Use the code ORKDNA09 at checkout to get recommended eBook retail price for $9 only until June 30, 2018.