Handling duplicates
Table of Contents
Description
Handling duplicates is a crucial step in data preprocessing. Duplicate data can arise due to data entry errors, merging datasets, or scraping processes. Removing or resolving these duplicates ensures data accuracy and prevents model bias.
You can:
Identify duplicate rows
Remove them
Keep specific occurrences (first/last)
Check for duplicates based on certain columns only
Prerequisites
- Familiarity with Pandas
- Understanding of DataFrames
- Basic Python conditionals
Examples
Here's a simple example of a data science task using Python:
import pandas as pd
# Sample DataFrame with duplicates
data = {
'Name': ['John', 'Anna', 'John', 'Linda', 'Anna'],
'Age': [25, 22, 25, 33, 22],
'Gender': ['M', 'F', 'M', 'F', 'F']
}
df = pd.DataFrame(data)
# Check for duplicate rows
duplicates = df.duplicated()
print("Duplicate rows:\n", duplicates)
# Drop all duplicates (keep the first occurrence by default)
df_no_duplicates = df.drop_duplicates()
# Drop duplicates but keep the last occurrence
df_keep_last = df.drop_duplicates(keep='last')
# Drop duplicates based on specific columns
df_name_only = df.drop_duplicates(subset=['Name'])
# Display results
print("Without duplicates:\n", df_no_duplicates)
print("Based on Name column:\n", df_name_only)
Real-World Applications
Healthcare
Avoid multiple entries of the same patient record
Finance
Clean duplicate transaction logs before auditing
Where topic Is Applied
Healthcare
- Filter duplicate patient records
- Validate lab test entries
Finance
- Ensure clean transaction history
- Remove repeated loan applications
Resources
Data Science topic PDF
Harvard Data Science Course
Free online course from Harvard covering data science foundations
Interview Questions
➤ Use df.duplicated() which returns a Boolean Series.
➤ Use df.drop_duplicates()
➤ It decides which duplicate to keep: 'first', 'last', or False (drop all).
➤ Use df.drop_duplicates(subset=['column_name']).
➤ To ensure data integrity, prevent skewed analysis, and improve model performance.