Combining and merging datasets
Table of Contents
Description
Combining and merging datasets is a fundamental data preprocessing step where multiple DataFrames are joined together using common keys or indexes. It’s essential for bringing together scattered data into a single unified dataset for analysis or modeling.
Concatenation joins datasets vertically or horizontally.
Merge/Join combines datasets based on keys/columns.
Append adds one DataFrame to the end of another.
Prerequisites
- Understanding of Pandas library
- Knowledge of DataFrame structures
- Basic concepts of SQL-style joins (INNER, LEFT, RIGHT, OUTER)
Examples
Here's a simple example of a data science task using Python:
import pandas as pd
# Create two DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})
df2 = pd.DataFrame({
'ID': [1, 2, 4],
'Age': [24, 30, 28]
})
# --------- Merge (like SQL joins) ----------
merged_df = pd.merge(df1, df2, on='ID', how='inner') # Only matching IDs
# --------- Outer Merge (to keep all data) ----------
outer_df = pd.merge(df1, df2, on='ID', how='outer') # Keeps all rows
# --------- Concatenation (row-wise) ----------
df3 = pd.DataFrame({'ID': [4], 'Name': ['David']})
concat_df = pd.concat([df1, df3], ignore_index=True)
# --------- Append (alternative to concat) ----------
appended_df = df1.append(df3, ignore_index=True)
# --------- Concatenation (column-wise) ----------
df4 = pd.DataFrame({'Country': ['USA', 'UK', 'India']})
col_concat = pd.concat([df1, df4], axis=1)
print(merged_df)
print(outer_df)
print(concat_df)
print(col_concat)
Real-World Applications
Finance: Merging customer records from multiple branches, Joining transaction data with user profiles
Healthcare: Combining patient info with diagnosis or treatment data, Merging datasets from different hospitals
E-commerce: Joining product catalog with customer orders, Combining website logs with sales data
Where topic Is Applied
Finance
- Merging user KYC data with financial transactions
E-commerce
- Joining clickstream data with cart and purchase info
Marketing
- Merging ad impressions with engagement and conversion records
Resources
Data Science topic PDF
Harvard Data Science Course
Free online course from Harvard covering data science foundations
Interview Questions
➤ merge() joins datasets based on a key (like SQL), while concat() combines DataFrames by stacking them either row-wise or column-wise.
➤ It performs a full outer join, returning all records from both DataFrames and filling in missing values with NaN.
➤ When combining DataFrames row-wise and you don’t want to keep the original index.
➤ Yes, by passing a list of column names to the on parameter.
➤ join() is a method of the DataFrame object and merges on the index by default. merge() is more flexible and allows merging on arbitrary columns.