Combining and merging datasets

Introduction Reading Time: 12 min

Table of Contents

Description

Combining and merging datasets is a fundamental data preprocessing step where multiple DataFrames are joined together using common keys or indexes. It’s essential for bringing together scattered data into a single unified dataset for analysis or modeling.
Concatenation joins datasets vertically or horizontally.
Merge/Join combines datasets based on keys/columns.
Append adds one DataFrame to the end of another.

Prerequisites

  • Understanding of Pandas library
  • Knowledge of DataFrame structures
  • Basic concepts of SQL-style joins (INNER, LEFT, RIGHT, OUTER)

Examples

Here's a simple example of a data science task using Python:


import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie']
})

df2 = pd.DataFrame({
    'ID': [1, 2, 4],
    'Age': [24, 30, 28]
})

# --------- Merge (like SQL joins) ----------
merged_df = pd.merge(df1, df2, on='ID', how='inner')  # Only matching IDs

# --------- Outer Merge (to keep all data) ----------
outer_df = pd.merge(df1, df2, on='ID', how='outer')  # Keeps all rows

# --------- Concatenation (row-wise) ----------
df3 = pd.DataFrame({'ID': [4], 'Name': ['David']})
concat_df = pd.concat([df1, df3], ignore_index=True)

# --------- Append (alternative to concat) ----------
appended_df = df1.append(df3, ignore_index=True)

# --------- Concatenation (column-wise) ----------
df4 = pd.DataFrame({'Country': ['USA', 'UK', 'India']})
col_concat = pd.concat([df1, df4], axis=1)

print(merged_df)
print(outer_df)
print(concat_df)
print(col_concat)
          

Real-World Applications

Finance: Merging customer records from multiple branches, Joining transaction data with user profiles

Healthcare: Combining patient info with diagnosis or treatment data, Merging datasets from different hospitals

E-commerce: Joining product catalog with customer orders, Combining website logs with sales data

Where topic Is Applied

Finance

  • Merging user KYC data with financial transactions

E-commerce

  • Joining clickstream data with cart and purchase info

Marketing

  • Merging ad impressions with engagement and conversion records

Resources

Data Science topic PDF

Download

Harvard Data Science Course

Free online course from Harvard covering data science foundations

Visit

Interview Questions

➤ merge() joins datasets based on a key (like SQL), while concat() combines DataFrames by stacking them either row-wise or column-wise.

➤ It performs a full outer join, returning all records from both DataFrames and filling in missing values with NaN.

➤ When combining DataFrames row-wise and you don’t want to keep the original index.

➤ Yes, by passing a list of column names to the on parameter.

➤ join() is a method of the DataFrame object and merges on the index by default. merge() is more flexible and allows merging on arbitrary columns.