handling missing data
Table of Contents
Description
Handling missing data is a crucial step in data preprocessing. Real-world datasets often contain missing values represented as NaN (Not a Number) or None. These missing values can skew analysis and lead to inaccurate predictions. To deal with them effectively, we use methods like:
Detection of missing values
Removal of rows/columns with missing values
Imputation using strategies such as mean, median, mode, or advanced techniques
Prerequisites
- Understanding of Pandas DataFrames
- Basic statistics (mean, median, mode)
- Familiarity with data types in Python
Examples
Here's a simple example of a data science task using Python:
import pandas as pd
import numpy as np
# Creating a sample dataset with missing values
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda', 'Tom'],
'Age': [25, np.nan, 29, 33, np.nan],
'Gender': ['M', 'F', 'M', 'F', None]
}
df = pd.DataFrame(data)
# Detect missing values
print(df.isnull()) # Returns a DataFrame of booleans
print(df.isnull().sum()) # Count of missing values per column
# Drop rows with any missing value
df_dropped = df.dropna()
# Fill missing values with mean (for numerical columns)
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Fill missing values with a specific value (for categorical columns)
df['Gender'].fillna('Unknown', inplace=True)
print(df)
Real-World Applications
Healthcare
Missing patient vitals imputed for diagnosis
Incomplete lab records filled with median values
Finance
Fill missing credit scores with average values
Handle gaps in transaction history
Where topic Is Applied
Healthcare
- Handling missing clinical trial values
- Filling absent symptoms in patient records
E-commerce
- Dealing with incomplete customer profiles
- Product metadata imputation
Resources
Data Science topic PDF
Harvard Data Science Course
Free online course from Harvard covering data science foundations
Interview Questions
➤ NaN stands for “Not a Number” and represents missing or undefined data in numerical arrays or DataFrames.
➤ Use df.isnull() to detect and df.isnull().sum() to count missing values.
➤ You can either drop them (df.dropna()) or fill them using imputation (df.fillna()).
➤ Mean, median, mode, forward-fill (ffill), backward-fill (bfill), or interpolation.
➤ If the missing data is very small in number or from non-critical rows/columns.