handling missing data

Introduction Reading Time: 12 min

Table of Contents

Description

Handling missing data is a crucial step in data preprocessing. Real-world datasets often contain missing values represented as NaN (Not a Number) or None. These missing values can skew analysis and lead to inaccurate predictions. To deal with them effectively, we use methods like:
Detection of missing values
Removal of rows/columns with missing values
Imputation using strategies such as mean, median, mode, or advanced techniques

Prerequisites

  • Understanding of Pandas DataFrames
  • Basic statistics (mean, median, mode)
  • Familiarity with data types in Python

Examples

Here's a simple example of a data science task using Python:


import pandas as pd
import numpy as np

# Creating a sample dataset with missing values
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda', 'Tom'],
    'Age': [25, np.nan, 29, 33, np.nan],
    'Gender': ['M', 'F', 'M', 'F', None]
}
df = pd.DataFrame(data)

# Detect missing values
print(df.isnull())             # Returns a DataFrame of booleans
print(df.isnull().sum())       # Count of missing values per column

# Drop rows with any missing value
df_dropped = df.dropna()

# Fill missing values with mean (for numerical columns)
df['Age'].fillna(df['Age'].mean(), inplace=True)

# Fill missing values with a specific value (for categorical columns)
df['Gender'].fillna('Unknown', inplace=True)

print(df)
          

Real-World Applications

Healthcare

Missing patient vitals imputed for diagnosis
Incomplete lab records filled with median values

Finance

Fill missing credit scores with average values
Handle gaps in transaction history

Where topic Is Applied

Healthcare

  • Handling missing clinical trial values
  • Filling absent symptoms in patient records

E-commerce

  • Dealing with incomplete customer profiles
  • Product metadata imputation

Resources

Data Science topic PDF

Download

Harvard Data Science Course

Free online course from Harvard covering data science foundations

Visit

Interview Questions

➤ NaN stands for “Not a Number” and represents missing or undefined data in numerical arrays or DataFrames.

➤ Use df.isnull() to detect and df.isnull().sum() to count missing values.

➤ You can either drop them (df.dropna()) or fill them using imputation (df.fillna()).

➤ Mean, median, mode, forward-fill (ffill), backward-fill (bfill), or interpolation.

➤ If the missing data is very small in number or from non-critical rows/columns.