Gold BlogKnow Your Data: Part 1

This article will introduce the different type of data sets, data object and attributes.



By Krishna Kumar Tiwari, Data Science Architect at InMobi

In my previous article, we discussed about the chemistry between Big Data and Machine Learning. We also reach to the fact that for making a good learning model we need to first understand our data well.

Below picture represents the machine learning & data mining process in general. Data cleaning and Feature extraction is the most tedious job but you need to be good at it make your model more accurate.

 

What is Data or Data Set?

 
Collection of data objects and their attributes. Attributes captures the basic characteristics of an object. Lets check the famous Titanic Data set which is available here. It represents the list of passenger information and their survival status (survived=0/1).

Each row represents an passenger object and the columns are basically characteristics of that passenger also called attributes. Attributes values are numbers or some categories. Ex Age is number but Pclass basically represents the passenger class (1 means 1st class, 2 means 2nd class …)

 

Type of Attributes?

 
To understand the Data, you need to first understand the different types of attributes. The type of an attribute depends on which of the following properties it possesses:

Distinctness: =, !=
Order: < >
Addition: + -
Multiplication: * /


 
Nominal

Follows the distinctness property, examples: ID numbers, eye color, zip codes.

 
Ordinal

Follows the distinctness & order, examples: rankings (e.g., taste of potato chips on a scale from 1–10), grades, height in {tall, medium, short}

 
Interval

Follows the distinctness, order & addition, examples: calendar dates, temperatures in Celsius or Fahrenheit.

 
Ratio

Follows all four properties, examples: temperature in Kelvin, length, time, counts

Based on the values of the attributes we can say attributes can be of two types.

Discrete attributes: It will have finite or countable infinite set of values, zip code, Pclass are the perfect examples of this. Binary attributes are special form of discrete attributes(survived attributed in Titanic data set is an example of binary attribute)

Continuous attributes: It will have real number as values, example: fare in Titanic data set.

 

Types of Data Sets?

 
Record Data Sets: Consists of collection of records with fixed set of attributes also know as structured data. Below are the few examples of record data set.

Figure

Data Matrix

 

Figure

Document Data

 

Figure

Transaction Data

 

 
Graph Data Sets:

Figure

World Wide Web

 

Figure

Molecular Structures

 

 
Ordered Data Sets:

Figure

Sequential Data

 

Figure

Temporal Data

 

In this article, we understood the different type of data sets, data object and attributes. In my next article we will understand the issues related to the data sets, how to identify and deal with it. We will also walk through an example on how to do feature extraction on Titanic data set.

Thanks for reading this, please share your thoughts, feedback &ideas in comments. You can also reach out me on @simplykk87 on twitter and linkedin.

 
References

Introduction to Data Mining,Pang-Ning Tan, Michigan State University, Michael Steinbach, University of Minnesota Vipin Kumar, University of Minnesota

 
Bio: Krishna Kumar Tiwari is a Data Science Architect at InMobi.

Original. Reposted with permission.

Related: