Home › Topics › Data Cleaning › handling missing data

handling missing data

Introduction Reading Time: 12 min

Description
Prerequisites
Examples
Real-World Applications
Where topic Is Applied
Resources
Interview Questions

Description

Handling missing data is a crucial step in data preprocessing. Real-world datasets often contain missing values represented as NaN (Not a Number) or None. These missing values can skew analysis and lead to inaccurate predictions. To deal with them effectively, we use methods like:
Detection of missing values
Removal of rows/columns with missing values
Imputation using strategies such as mean, median, mode, or advanced techniques

Prerequisites

Understanding of Pandas DataFrames
Basic statistics (mean, median, mode)
Familiarity with data types in Python

Examples

Here's a simple example of a data science task using Python:


import pandas as pd
import numpy as np

# Creating a sample dataset with missing values
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda', 'Tom'],
    'Age': [25, np.nan, 29, 33, np.nan],
    'Gender': ['M', 'F', 'M', 'F', None]
}
df = pd.DataFrame(data)

# Detect missing values
print(df.isnull())             # Returns a DataFrame of booleans
print(df.isnull().sum())       # Count of missing values per column

# Drop rows with any missing value
df_dropped = df.dropna()

# Fill missing values with mean (for numerical columns)
df['Age'].fillna(df['Age'].mean(), inplace=True)

# Fill missing values with a specific value (for categorical columns)
df['Gender'].fillna('Unknown', inplace=True)

print(df)

Real-World Applications

Healthcare

Missing patient vitals imputed for diagnosis
Incomplete lab records filled with median values

Finance

Fill missing credit scores with average values
Handle gaps in transaction history

Where topic Is Applied

Healthcare

Handling missing clinical trial values
Filling absent symptoms in patient records

E-commerce

Dealing with incomplete customer profiles
Product metadata imputation

Resources

Data Science topic PDF

Download

Harvard Data Science Course

Free online course from Harvard covering data science foundations

Visit

Interview Questions

➤ NaN stands for “Not a Number” and represents missing or undefined data in numerical arrays or DataFrames.

➤ Use df.isnull() to detect and df.isnull().sum() to count missing values.

➤ You can either drop them (df.dropna()) or fill them using imputation (df.fillna()).

➤ Mean, median, mode, forward-fill (ffill), backward-fill (bfill), or interpolation.

➤ If the missing data is very small in number or from non-critical rows/columns.

Data Science in my style

handling missing data

Table of Contents

Description

Prerequisites

Examples

Real-World Applications

Healthcare

Finance

Where topic Is Applied

Healthcare

E-commerce

Resources

Data Science topic PDF

Harvard Data Science Course

Interview Questions