Know Your Data: Part 2

To build an effective learning model, it is must to understand the quality issues exist in data & how to detect and deal with it. In general, data quality issues are categories in four major sets.



By Krishna Kumar Tiwari, Data Science Architect at InMobi

In my previous article, we understood data sets, objects and attributes in detail. Let’s understand how to define the Data quality.

There are many definitions of data quality but data is generally considered high quality if it is “fit for [its] intended uses in operations, decision making and planning”.

To build an effective learning model, it is must to understand the quality issues exist in data & how to detect and deal with it. In general, data quality issues are categories in four major sets.

 

Noise

 
Many says if there is no noise in data, data mining would be too easy. Noise in data represent the modification of original values. Prof Jeff M. Phillips from University of Utah defines the main causes of noise in data as mentioned below.

  • Spurious readings
    These are data points that could be anywhere, and are sometimes ridiculously far from where the real data should have been. With small data sets, these are often pruned by hand. With large sensed datasets, these need to be automatically dealt with. In high dimensions they are often impossible to physically “see.”

  • Measurement error
    This is a small amount of error that can occur on all data points and in every dimension. It can occur because a sensor is not accurate enough, a rounding or approximation error occurs before the data reaches you, or a truncation error in a conversion to an abstract data type. This type of data noise may have little effect on the structure you are trying to measure, but can violate noiseless assumptions and should be understood.

  • Background data
    These data are from something other than what you are trying to measure. This could be unlabeled data (like unrated movies on Netflix) or people who don’t have a disease you are trying to monitor. The problem is that it sometimes gets mixed in, and is then indistinguishable from the actual data of the phenomenon you are trying to monitor.

Some data miners also studied the noise in data in more granular level as mentioned below.

Noise in data is well know problem and it is frequently difficult too. People apply different noise filters to remove the noise from data but mostly people focus on building robust algorithms which can give acceptable performance even if the data is noisy. Performance of model/algorithm is calculated based on precision, bias and accuracy.

 

Outliers

 
As the name implies, outliers are data objects which are considerably different than most of the other data objects. The object pointed in below image has different (X,Y) attributes than all other data objects hence qualifies for outlier.

Standard way of dealing with outliers are the univariate method, the multivariate method and the Minkowski error. We will cover these methods in details in coming articles.

 

Missing values

 
It is very much possible to have data objects with missing one or multiple attribute values.

Mostly missing values are because information is not collected or attribute is not applicable for the data object. People deal with missing values by below four ways.

  • Eliminate Data Objects: Easiest way out but if your data set has multiple objects like that then definitely your model will not get enough data objects to learn.
  • Estimate Missing Values: Missing values can be interpolated by checking the value of other object for the same attributed.
  • Ignore the Missing Value During Analysis: Definitely ignoring the missing value will lead to some inaccuracy but its fine if the number of missing values or not high. Many learning models are capable enough to deal with inaccuracy caused by missing values.
  • Replace with all possible values (weighted by their probabilities): This is the smart way. You can fill the missing value by mean, median or based on look a like object values etc or may we weighted values.

 

Duplicate data

 
Data set may have data objects which are duplicate in nature, same object repeated in set with different value of an attribute. See the green box in the below data.

First we need to identify duplicate objects and second resolve the inconsistent value by merging the two objects in one. The challenge here is, sometime two objects looks duplicate but they are really different objects hence detecting the duplicate objects should be precious call.

We are now pretty much familiar with data set & data quality issues. In my next article we will apply all these technique on Titanic Data Set available on Kaggle.

Thanks for reading this, please share your thoughts, feedback &ideas in comments. You can also reach out me on @simplykk87 on twitter and linkedin.

References:

  • http://sci2s.ugr.es/noisydata
  • http://www.ijorcs.org/uploads/archive/Vol2-Iss2-05-duplicate-detection-of-records-in-queries-using-clustering.pdf
  • https://www.cs.utah.edu/~jeffp/teaching/cs5140/L19-Noise.pdf
  • http://www-users.cs.umn.edu/~kumar/dmbook/index.php
  • https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
  • https://www.kdnuggets.com/2017/01/3-methods-deal-outliers.html

 
Bio: Krishna Kumar Tiwari is a Data Science Architect at InMobi.

Original. Reposted with permission.

Related: