Top R Packages for Data Cleaning
Data cleaning is one of the most important and time consuming task for data scientists. Here are the top R packages for data cleaning.
By Anna Kayfitz, CEO of StrategicDB Corp
As millions or billions of data elements come into your business each day, it is almost inevitable that some of it will lack the necessary quality to create efficient business models. Ensuring that your data is clean should always be the first and arguably most important part of a Data Science workflow as without it, you will have difficulty seeing what is important and potentially make the wrong decisions due to duplicates, anomalies or missing information.
One of the most common and powerful data programming tools is R, an open source language and environment for statistical computing and graphics. R provides uses with all the tools needed to create data science projects but with anything, it is only as good as the data that feeds into it. With that, there are a number of libraries within the R environment that help with data cleaning and manipulation before the start of any project.
Exploring the data
Most of the tools for exploring a set of data that you’ve imported already exist within the R platform.
This handy command simply gives an overview of all your data attributes, displaying the min, max, median, mean and category splits for each. It is a good method for quickly spotting any potential data anomalies.
Following on from this, you can use a histogram to better understand the distribution of your data. This will visualise show any outliers within the dataset or any numeric column that you are particularly looking to observe.
The plyr package
You will need to install the plyr package to create your Histogram, using the standard R functionality for installing libraries
This will create a visualisation of your data to spot any anomalies quickly for. A boxplot visualisation uses the same package but splits into quartiles for outlier detection. Both of these combined will quickly tell you if you need to limit the dataset or only use certain segments of it within any algorithms or statistical modelling.
R has a number of pre-built methods for correcting data errors such as converting values as you might do in Excel or SQL with simple logic e.g. as.charater() converts the column to a character string.
However, if you want to start correcting the errors that you saw in your histogram or boxplot, there are additional packages that have the capability of doing just that.
The stringr package
There are a few different ways in which stringr can help cleanse your data including trimming white spaces and replacing certain unnecessary words. These are quite standard bits of code structured as str_trim(YOUR_DATA_FIELD) which simply removes the white space.
However, what about removing the anomalies that our histogram told us we had? It would need a bit more complexity than this but as a basic example, we can tell R to replace all the outliers in our field with the median value of that field. This will move everything in together and take away anomaly bias.
It is very simple in R to check for incomplete data and perform and action with that field. For example, this function will eliminate missing vales completely from your chosen data column.
There are similar options to replace blank values with 0’s or N/A depending on the field type and improve the consistency of the dataset.
The tidyr package
The tidyr package is designed to tidy your data. It works by identifying the variables in your dataset and using the tools provided to move them into columns with three main functions or gather (), separate () and spread().
The gather() function takes multiple columns and gathers them into key value pairs. A an example, say you have exam score data like.
|Name||Exam A||Exam B|
The gather functions work by transforming that into usable columns like this.
Now we are truly able to analyse the exam scores. The separate and spread functions do similar things which you can explore once you have the package but ultimately theyalig your data as needed.
Here are a few other packages of note that may be useful for data cleansing in R
- The purr package
The purr package is designed for data wrangling. It is quite similar to the plyr package, albeit older and some users simply find it easier to use and more standardised in its functionality.
- The sqldf package
A lot of R users are more comfortable coding in SQL language rather than R. This function allows you to write SQL code within R studio to select your data elements
- The janitor package
This package is able to find duplicates by multiple columns and make friendly columns with ease from your dataframe. It even has a get_dupes() function for finding duplicate values amongst multiple rows of data. If you are looking to dedupe your data in a more advanced manner, for example, finding different combinations or using fuzzy logic, you may want to look into a deduping tool instead.
- The splitstackshape package
This is an older package that can work with comma separated values in a dataframe column. Useful for survey or text analysis preparation.
R has a huge number of packages and this article only really touches the surface of what it can do. With new libraries popping up all the time it is important to do your research and get the right ones for you before starting any new project.
Bio: Anna Kayfitz is a CEO of StrategicDB Corp, a data cleansing and analytics company. She holds an MBA from a Schulich School of Business, and has spent over 10 years working in data analytics and marketing roles prior to founding StrategicDB.
- On-line and web-based: Analytics, Data Mining, Data Science, Machine Learning education
- Software for Analytics, Data Science, Data Mining, and Machine Learning
- Don’t do analysis in a vacuum
- Running R and Python in Jupyter
- 2018’s Top 7 R Packages for Data Science and AI