# 11 Essential Code Blocks for Complete EDA (Exploratory Data Analysis)

This article is a practical guide to exploring any data science project and gain valuable insights.

**By Susan Maina, Passionate about data, machine learning enthusiast, writer at Medium**

Exploratory Data Analysis, or EDA, is one of the first steps of theÂ data science process. It involves learning as much as possible about the data, without spending too much time. Here, you get an instinctive as well as a high-level practical understanding of the data. By the end of this process, you should have a general idea of the structure of the data set, some cleaning ideas, the target variable and, possible modeling techniques.

There are some general strategies to quickly perform EDA in most problems. In this article, I will use theÂ Melbourne Housing snapshot datasetÂ from kaggle to demonstrate the 11 blocks of code you can use to perform a satisfactory exploratory data analysis. The dataset includesÂ `Address`

,Â `Type`

Â of Real estate,Â `Suburb`

,Â `Method`

Â of Selling,Â `Rooms`

,Â `Price`

, Real Estate AgentÂ `(SellerG)`

,Â `Date`

Â of Sale and,Â `Distance`

Â from C.B.D. You can follow along by downloading the datasetÂ here.

The first step is importing the libraries required. We will needÂ Pandas,Â Numpy,Â matplotlibÂ andÂ seaborn. To make sure all our columns are displayed, useÂ `pd.set_option(â€™display.max_columnsâ€™, 100)`

Â . By default, pandas displays 20 columns and hides the rest.

```
import pandas as pd
pd.set_option('display.max_columns',100)import numpy as npimport matplotlib.pyplot as plt
%matplotlib inlineimport seaborn as sns
sns.set_style('darkgrid')
```

Pandaâ€™sÂ `pd.read_csv(path)`

Â reads in the csv file as a DataFrame.

`data = pd.read_csv('melb_data.csv')`

### Basic data set Exploration

**1. Shape (dimensions) of the DataFrame**

TheÂ `.shape`

Â attribute of a Pandas DataFrame gives an overall structure of the data. It returns aÂ tupleÂ of length 2 that translates to how many rows of observations and columns the dataset has.

```
data.shape### Results
(13580, 21)
```

We can see that the dataset has 13,580 observations and 21 features, and one of those features is the target variable.

**2. Data types of the various columns**

The DataFrameâ€™sÂ `.dtypes`

Â attribute displays the data types of the columns as a Pandaâ€™sÂ SeriesÂ (Series means a column of values and their indices).

```
data.dtypes### Results
Suburb object
Address object
Rooms int64
Type object
Price float64
Method object
SellerG object
Date object
Distance float64
Postcode float64
Bedroom2 float64
Bathroom float64
Car float64
Landsize float64
BuildingArea float64
YearBuilt float64
CouncilArea object
Lattitude float64
Longtitude float64
Regionname object
Propertycount float64
dtype: object
```

We observe that our dataset has a combination ofÂ **categoricalÂ **(object) andÂ **numericÂ **(float and int) features. At this point, I went back to the Kaggle page for an understanding of the columns and their meanings. Check out the table of columns and their definitionsÂ hereÂ created withÂ Datawrapper.

What to look out for;

- Numeric features that should be categorical and vice versa.

From a quick analysis, I did not find any mismatch for the datatypes. This makes sense as this dataset version is a cleaned snapshot of the originalÂ Melbourne data.

**3. Display a few rows**

The Pandas DataFrame has very handy functions for displaying a few observations.Â `data.head()`

displays the first 5 observations,Â `data.tail()`

Â the last 5, andÂ `data.sample()`

Â an observation chosen randomly from the dataset. You can display 5 random observations usingÂ `data.sample(5)`

```
data.head()
data.tail()
data.sample(5)
```

What to look out for:

- Can you understand the column names? Do they make sense? (Check with the variable definitions again if needed)
- Do the values in these columns make sense?
- Are there significant missing values (NaN) sighted?
- What types of classes do the categorical features have?

My insights; theÂ `PostcodeÂ `

andÂ `Propertycount`

features both changed according to the `Suburb`

feature. Also, there were significant missing values for theÂ `BuildingAreaÂ `

andÂ `YearBuilt`

.

### Distribution

This refers to how the values in a feature are distributed, or how often they occur. For numeric features, weâ€™ll see how many times groups of numbers appear in a particular column, and for categorical features, the classes for each column and their frequency. We will use bothÂ **graphs**Â and actual summaryÂ **statistics**. The graphs enable us to get an overall idea of the distributions while the statistics give us factual numbers. These two strategies are both recommended as they complement each other.

### Numeric Features

**4. Plot each numeric feature**

We will use PandasÂ histogram. A histogram groups numbers into ranges (or bins) and the height of a bar shows how many numbers fall in that range.Â `df.hist()`

Â plots a histogram of the dataâ€™s numeric features in a grid. We will also provide theÂ `figsize`

Â andÂ `xrot`

Â arguments to increase the grid size and rotate the x-axis by 45 degrees.

```
data.hist(figsize=(14,14), xrot=45)
plt.show()
```

Histogram by author

What to look out for:

- Possible outliers that cannot be explained or might be measurement errors
- Numeric features that should be categorical. For example,Â
`Gender`

Â represented by 1 and 0. - Boundaries that do not make sense such as percentage values> 100.

From the histogram, I noted thatÂ `BuildingArea`

Â andÂ `LandSize`

Â had potential outliers to the right. Our target featureÂ `Price`

Â was also highly skewed to the right. I also noted thatÂ `YearBuilt`

Â was very skewed to the left and the boundary started at the year 1200 which was odd. Letâ€™s move on to the summary statistics for a clearer picture.

**5. Summary statistics of the numerical features**

Now that we have an intuitive feel of the numeric features, we will look at actual statistics usingÂ `df.describe()`

which displays their summary statistics.

`data.describe()`

We can see for each numeric feature, theÂ *count*Â of values in it, theÂ *mean*Â value,Â *stdÂ *or standard deviation,Â *minimum*Â value, theÂ *25th*Â percentile, theÂ *50th*Â percentile or median, theÂ *75th*Â percentile, and theÂ *maximum*Â value. From the count we can also identify the features withÂ **missing values**; their count is not equal to the total number of rows of the dataset. These areÂ `Car`

,Â `LandSize`

Â andÂ `YearBuilt.`

I noted that the minimum for both theÂ `LandSize`

Â andÂ `BuildingArea`

Â is 0. We also see that theÂ `Price`

Â ranges from 85,000 to 9,000,000 which is a big range. We will explore these columns in detailed analysis later in the project.

Looking at theÂ `YearBuilt`

Â feature, however, we note that the minimum year is 1196. This could be a possible data entry error that will be removed during cleaning.

### Categorical features

**6. Summary statistics of the categorical features**

For categorical features, it is important to show the summary statistics before we plot graphs because some features have a lot of unique classes (like we will see for theÂ `Address`

) and the classes would be unreadable if visualized on a countplot.

To check the summary statistics of only the categorical features, we will useÂ `df.describe(include=â€™objectâ€™)`

`data.describe(include='object')`

Categorical summary statistics by author

This table is a bit different from the one for numeric features. Here, we get theÂ *count*Â of the values of each feature, the number ofÂ *unique*Â classes, theÂ *top*Â most frequent class, and howÂ *frequently*Â that class occurs in the data set.

We note that some classes have a lot of unique values such asÂ `Address`

, followed byÂ `Suburb`

Â andÂ `SellerG`

. From these findings, I will only plot the columns with 10 or less unique classes. We also note thatÂ `CouncilArea`

Â has missing values.

**7. Plot each categorical feature**

Using the statistics above, we note thatÂ `Type`

,Â `MethodÂ `

andÂ `RegionnameÂ `

have less than 10 classes and can be effectively visualized. We will plot these features using theÂ Seaborn countplot, which is like a histogram for categorical variables. Each bar in a countplot represents a unique class.

I created aÂ For loop. For each categorical feature, a countplot will be displayed to show how the classes are distributed for that feature. The lineÂ `df.select_dtypes(include=â€™objectâ€™)`

Â selects the categorical columns with their values and displays them. We will also include anÂ If-statementÂ so as to pick only the three columns with 10 or fewer classes using the lineÂ `Series.nunique() < 10`

. Read theÂ `.nunique()`

Â documentationÂ here.

```
for column in data.select_dtypes(include='object'):
if data[column].nunique() < 10:
sns.countplot(y=column, data=data)
plt.show()
```

Count plots by author

What to look out for:

- Sparse classes which have the potential to affect a modelâ€™s performance.
- Mistakes in labeling of the classes, for example 2 exact classes with minor spelling differences.

We note thatÂ `Regionname`

Â has some sparse classes which might need to be merged or re-assigned during modeling.

### Grouping and segmentation

Segmentation allows us to cut the data and observe the relationship between categorical and numeric features.

**8. Segment the target variable by categorical features.**

Here, we will compare the target feature,Â `Price`

, between the various classes of our main categorical featuresÂ `(Type`

,Â `Method`

Â andÂ `Regionname)`

Â and see how theÂ `Price`

Â changes with the classes.

We use theÂ Seaborn boxplotÂ which plots the distribution ofÂ `Price`

Â across the classes of categorical features.Â ThisÂ tutorial, from where I borrowed the Image below, explains the boxplotâ€™s features clearly. The dots at both ends represent outliers.

Image fromÂ www.geekeforgeeks.org

Again, I used aÂ *for loop*Â to plot a boxplot of each categorical feature withÂ `Price`

.

```
for column in data.select_dtypes(include=â€™objectâ€™):
if data[column].nunique() < 10:
sns.boxplot(y=column, x=â€™Priceâ€™, data=data)
plt.show()
```

Box plots by author

What to look out for:

- which classes most affect the target variables.

Note how theÂ `Price`

Â is still sparsely distributed among the 3 sparse classes ofÂ `Regionname`

Â seen earlier, strengthening our case against these classes.

Also note how theÂ `SA`

Â class (the least frequentÂ `Method`

Â class) commands high prices, almost similar prices of the most frequently occurring classÂ `S.`

**9. Group numeric features by each categorical feature.**

Here we will see how all the other numeric features, not justÂ `Price`

, change with each categorical feature by summarizing the numeric features across the classes. We use theÂ Dataframeâ€™s groupbyÂ function to group the data by a category and calculate a metric (such asÂ *mean*,Â *median*,Â *min*,Â *std,Â *etc) across the various numeric features.

For only the 3 categorical features with less than 10 classes, we group the data, then calculate theÂ `mean`

Â across the numeric features. We useÂ `display()`

Â which results to a cleaner table thanÂ `print()`

.

```
for column in data.select_dtypes(include='object'):
if data[column].nunique() < 10:
display(data.groupby(column).mean())
```

We get to compare theÂ `Type,`

Â `Method`

Â andÂ `Regionname`

Â classes across the numeric features to see how they are distributed.

### Relationships between numeric features and other numeric features

**10. Correlations matrix for the different numerical features**

AÂ correlationÂ is a value between -1 and 1 that amounts to how closely values of two separate features move simultaneously. AÂ *positive*Â correlation means that as one feature increases the other one also increases, while aÂ *negative*Â correlation means one feature increases as the other decreases. Correlations close to 0 indicate aÂ *weak*Â relationship while closer to -1 or 1 signifies aÂ *strong*Â relationship.

Image fromÂ edugyan.in

We will useÂ `df.corr()`

Â to calculate theÂ correlationsÂ between the numeric features and it returns a DataFrame.

```
corrs = data.corr()
corrs
```

This might not mean much now, so let us plot a heatmap to visualize the correlations.

**11. Heatmap of the correlations**

We will use aÂ Seaborn heatmapÂ to plot the grid as a rectangular color-coded matrix. We useÂ `sns.heatmap(corrs, cmap=â€™RdBu_râ€™,annot=True)`

.

TheÂ `cmap=â€˜RdBu_râ€™`

Â argument tells the heatmap what colour palette to use. A high positive correlation appears asÂ *dark red*Â and a high negative correlation asÂ *dark blue*. Closer to white signifies a weak relationship. ReadÂ thisÂ nice tutorial for other color palettes.Â `annot=True`

Â includes the values of the correlations in the boxes for easier reading and interpretation.

```
plt.figure(figsize=(10,8))
sns.heatmap(corrs, cmap='RdBu_r', annot=True)
plt.show()
```

Heatmap by author

What to look out for:

- Strongly correlated features; either dark red (positive) or dark blue(negative).
- Target variable; If it has strong positive or negative relationships with other features.

We note thatÂ `Rooms`

,Â `Bedrooms2`

,Â `Bathrooms`

, and `Price`

have strong positive relationships. On the other hand, `Price`

, our target feature, has a slightly weak*Â negative*Â correlation withÂ `YearBuilt`

Â and an even weaker *negative* relationship withÂ `Distance`

Â from CBD.

In this article, we explored the Melbourne dataset and got a high-level understanding of the structure and its features. At this stage, we do not need to be 100% comprehensive because in future stages we will explore the data more elaborately. You can get the full code on GithubÂ here. I will be uploading the datasetâ€™s cleaning concepts soon.

**Bio: Susan Maina** is passionate about data, machine learning enthusiast, writer at Medium.

Original. Reposted with permission.

**Related:**

- Powerful Exploratory Data Analysis in just two lines of code
- Pandas Profiling: One-Line Magical Code for EDA
- Statistical and Visual Exploratory Data Analysis with One Line of Code