outlier detection

Introduction Reading Time: 12 min

Table of Contents

Description

Outliers are data points that significantly differ from other observations in a dataset. They can skew analysis, lead to incorrect conclusions, and affect model performance. Two common methods to detect outliers are:
IQR (Interquartile Range) Method
Z-score (Standard Score) Method

Prerequisites

  • Understanding of statistics
  • Concept of mean, median, quartiles, and standard deviation
  • Basic Python & NumPy/Pandas knowledge

Examples

Here's a simple example of a data science task using Python:


# IQR Method (Interquartile Range)
import pandas as pd

# Sample data
data = pd.DataFrame({'scores': [10, 12, 14, 15, 16, 18, 19, 30, 40, 100]})

# Calculate IQR
Q1 = data['scores'].quantile(0.25)
Q3 = data['scores'].quantile(0.75)
IQR = Q3 - Q1

# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter outliers
outliers = data[(data['scores'] < lower_bound) | (data['scores'] > upper_bound)]
print("Outliers detected using IQR:\n", outliers)

# Z-score Method
import numpy as np
from scipy import stats

# Convert column to numpy array
scores = np.array(data['scores'])

# Calculate Z-scores
z_scores = stats.zscore(scores)

# Find outliers (Z-score threshold ±3)
outliers_z = scores[(z_scores > 3) | (z_scores < -3)]
print("Outliers detected using Z-score:\n", outliers_z)


          

Real-World Applications

Finance

Fraud detection (suspicious transactions)

Healthcare

Unusual medical test results

Cybersecurity

Intrusion detection from access logs

Where topic Is Applied

  • Data preprocessing
  • Anomaly detection systems
  • Exploratory data analysis (EDA)

Resources

Data Science topic PDF

Download

Harvard Data Science Course

Free online course from Harvard covering data science foundations

Visit

Interview Questions

➤ An outlier is a data point that deviates significantly from other observations.

➤ It uses the interquartile range (Q3 - Q1) and identifies values that fall below Q1 - 1.5×IQR or above Q3 + 1.5×IQR.

➤ It measures how many standard deviations a data point is from the mean. Points with Z-scores > 3 or < -3 are considered outliers.

➤ IQR is preferred for skewed distributions since it is not affected by extreme values as much as the Z-score (which assumes normality).

➤ No, only if they are due to errors or not relevant. Some outliers may represent rare but valid conditions (e.g., fraud).