Home › Topics › Data Cleaning › Handling duplicates

Handling duplicates

Introduction Reading Time: 12 min

Description
Prerequisites
Examples
Real-World Applications
Where topic Is Applied
Resources
Interview Questions

Description

Handling duplicates is a crucial step in data preprocessing. Duplicate data can arise due to data entry errors, merging datasets, or scraping processes. Removing or resolving these duplicates ensures data accuracy and prevents model bias. You can:
Identify duplicate rows
Remove them
Keep specific occurrences (first/last)
Check for duplicates based on certain columns only

Prerequisites

Familiarity with Pandas
Understanding of DataFrames
Basic Python conditionals

Examples

Here's a simple example of a data science task using Python:


import pandas as pd

# Sample DataFrame with duplicates
data = {
    'Name': ['John', 'Anna', 'John', 'Linda', 'Anna'],
    'Age': [25, 22, 25, 33, 22],
    'Gender': ['M', 'F', 'M', 'F', 'F']
}
df = pd.DataFrame(data)

# Check for duplicate rows
duplicates = df.duplicated()
print("Duplicate rows:\n", duplicates)

# Drop all duplicates (keep the first occurrence by default)
df_no_duplicates = df.drop_duplicates()

# Drop duplicates but keep the last occurrence
df_keep_last = df.drop_duplicates(keep='last')

# Drop duplicates based on specific columns
df_name_only = df.drop_duplicates(subset=['Name'])

# Display results
print("Without duplicates:\n", df_no_duplicates)
print("Based on Name column:\n", df_name_only)

Real-World Applications

Healthcare

Avoid multiple entries of the same patient record

Finance

Clean duplicate transaction logs before auditing

Where topic Is Applied

Healthcare

Filter duplicate patient records
Validate lab test entries

Finance

Ensure clean transaction history
Remove repeated loan applications

Resources

Data Science topic PDF

Download

Harvard Data Science Course

Free online course from Harvard covering data science foundations

Visit

Interview Questions

➤ Use df.duplicated() which returns a Boolean Series.

➤ Use df.drop_duplicates()

➤ It decides which duplicate to keep: 'first', 'last', or False (drop all).

➤ Use df.drop_duplicates(subset=['column_name']).

➤ To ensure data integrity, prevent skewed analysis, and improve model performance.

Data Science in my style

Handling duplicates

Table of Contents

Description

Prerequisites

Examples

Real-World Applications

Healthcare

Finance

Where topic Is Applied

Healthcare

Finance

Resources

Data Science topic PDF

Harvard Data Science Course

Interview Questions