3 Ways to Merge Pandas DataFrames

Combine Pandas data frames using the merge, join, and concatenate operations.



3 Ways to Merge Pandas DataFrames
Image by Editor

 

Data in the real world is scattered and requires bringing different sources together on some common grounds. It also needs to be more efficient and affordable for organizations to store all data in a single table. Thus keeping data in multiple tables and then joining them together when needed is the way to get the best of both worlds, i.e., efficiency and valuable insights.

“Pandas” offers data frame merging, which is quite helpful in data analysis as it allows you to combine data from multiple sources into a single data frame. For example, imagine you have a sales dataset containing information on customer orders and another dataset containing customer demographics. By joining these two dataframes on the customer ID, you can create a new dataframe that includes all the information in one place, making it easier to analyze and understand the relationship between customer demographics and sales.

Combining these dataframes allows you to add additional columns to your data, such as calculated fields or aggregate statistics, that can drive sophisticated machine learning systems. Merging can also be helpful for data preparation tasks such as cleaning, normalizing, and pre-processing.

In this post, you will learn about the three ways to merge Pandas dataframes and the difference between the outputs. You will also be able to appreciate how it facilitates different data analysis use cases using merge, join and concatenate operations.

 

Merge

 

The merge() operation is a method used to combine two dataframes based on one or more common columns, also called keys. The resulting data frame contains only the rows from both dataframes with matching keys. The merge() function is similar to the SQL JOIN operation.

The basic syntax for using merge() is:

merged_df = pd.merge(df1, df2, on='key')

 

Here, df1 and df2 are the two dataframes you want to merge, and the “on” argument defines the column(s) for combining.

By default, pandas will perform an inner join, which means that only the rows with matching keys in both dataframes are included in the resulting dataframe. However, you can specify other types of joins, such as left, right, or outer join, using the how parameter.

Let’s understand this by an example, as shown below. 

Please note you can use a Jupyter Notebook (or an IDE of your choice) to run the below code.

import pandas as pd

# Define two dataframes
df1 = pd.dataframe({"key": ["A", "B", "C", "D"], "value1": [1, 2, 3, 4]})

df2 = pd.dataframe({"key": ["B", "D", "E", "F"], "value2": [5, 6, 7, 8]})

# Perform the merge
merged_df = pd.merge(df1, df2, on="key", how="inner")

# Show the resulting
print(merged_df)

 

This will produce the following output:

  key  value1  value2
0  B      2      5
1  D      4      6

 

As evident in the result, the new data frame merged_df contains only the rows where the values in the 'key' column match, i.e., B and D.

 

Join

 

On the other hand, the join() operation combines two dataframes based on their index, instead of a specific column. The resulting data frame contains only the rows from both dataframes with matching indexes.

The basic syntax for using join() is:

joined_df = df1.join(df2)

 

where df1 and df2 are the two dataframes to be joined.

By default, join() performs a left join, which means that all the rows in the first dataframe (df1) will be included in the resulting dataframe, and any rows in the second dataframe (df2) with matching index values will be added as well. The non-matching rows in the second data frame will have NaN values if there is no match. However, using the how parameter, you can specify other types of joins, such as right, inner or outer.

Let’s understand this by an example, as shown below.

import pandas as pd

# Define two dataframes
df1 = pd.dataframe({"value1": [1, 2, 3, 4]}, index=["A", "B", "C", "D"])

df2 = pd.dataframe({"value2": [5, 6, 7, 8]}, index=["B", "D", "E", "F"])

# Perform the join
joined_df = df1.join(df2, how="inner")

# Show the resulting
print(joined_df)

 

The above code would produce the following output:

   value1  value2
B      2      5
D      4      6

 

Here, the new data frame joined_df contains only the rows where the indices match, i.e., B and D.

 

Concat

 

The concat() is used to concatenate multiple pandas objects (dataframe or Series) along a particular axis (either rows or columns). By default, the axis is 0, meaning that data is concatenated along the rows (vertically). It takes a list of pandas objects as its first argument concatenated in the order specified in the list. 

concatenated_df = pd.concat([df1, df2])

 

The function can be customized through various parameters, such as axis, join, ignore_index, etc.

An example of using the Pandas concat function to combine two dataframes is shown below:

import pandas as pd

df1 = pd.dataframe(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    },
    index=[0, 1, 2, 3],
)

df2 = pd.dataframe(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    },
    index=[0, 1, 2, 3],
)

concatenated_df = pd.concat([df1, df2])
print(concatenated_df)

 

This will produce the following output:

    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
0  A4  B4  C4  D4
1  A5  B5  C5  D5
2  A6  B6  C6  D6
3  A7  B7  C7  D7

 

To avoid index duplication as shown above (indexes from 0 to 3 are occurring twice in the concatenated data frame), use ignore_index=True as shown below.

concatenated_df = pd.concat([df1, df2], ignore_index=True)
print(concatenated_df)

 

The result would look like the below.

    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
4  A4  B4  C4  D4
5  A5  B5  C5  D5
6  A6  B6  C6  D6
7  A7  B7  C7  D7

 

Summary

 

In this article, you learned three ways to merge Pandas data frames and how they solve different purposes when dealing with data in any BI project. The post illustrates examples of merge, join and concatenate operations using python code.

 
 
Vidhi Chugh is an AI strategist and a digital transformation leader working at the intersection of product, sciences, and engineering to build scalable machine learning systems. She is an award-winning innovation leader, an author, and an international speaker. She is on a mission to democratize machine learning and break the jargon for everyone to be a part of this transformation.