Handling duplicates

Introduction Reading Time: 12 min

Table of Contents

Description

Handling duplicates is a crucial step in data preprocessing. Duplicate data can arise due to data entry errors, merging datasets, or scraping processes. Removing or resolving these duplicates ensures data accuracy and prevents model bias. You can:
Identify duplicate rows
Remove them
Keep specific occurrences (first/last)
Check for duplicates based on certain columns only

Prerequisites

  • Familiarity with Pandas
  • Understanding of DataFrames
  • Basic Python conditionals

Examples

Here's a simple example of a data science task using Python:


import pandas as pd

# Sample DataFrame with duplicates
data = {
    'Name': ['John', 'Anna', 'John', 'Linda', 'Anna'],
    'Age': [25, 22, 25, 33, 22],
    'Gender': ['M', 'F', 'M', 'F', 'F']
}
df = pd.DataFrame(data)

# Check for duplicate rows
duplicates = df.duplicated()
print("Duplicate rows:\n", duplicates)

# Drop all duplicates (keep the first occurrence by default)
df_no_duplicates = df.drop_duplicates()

# Drop duplicates but keep the last occurrence
df_keep_last = df.drop_duplicates(keep='last')

# Drop duplicates based on specific columns
df_name_only = df.drop_duplicates(subset=['Name'])

# Display results
print("Without duplicates:\n", df_no_duplicates)
print("Based on Name column:\n", df_name_only)
          

Real-World Applications

Healthcare

Avoid multiple entries of the same patient record

Finance

Clean duplicate transaction logs before auditing

Where topic Is Applied

Healthcare

  • Filter duplicate patient records
  • Validate lab test entries

Finance

  • Ensure clean transaction history
  • Remove repeated loan applications

Resources

Data Science topic PDF

Download

Harvard Data Science Course

Free online course from Harvard covering data science foundations

Visit

Interview Questions

➤ Use df.duplicated() which returns a Boolean Series.

➤ Use df.drop_duplicates()

➤ It decides which duplicate to keep: 'first', 'last', or False (drop all).

➤ Use df.drop_duplicates(subset=['column_name']).

➤ To ensure data integrity, prevent skewed analysis, and improve model performance.