Home › Topics › Basic Statistics & Probability › outlier detection

outlier detection

Introduction Reading Time: 12 min

Description
Prerequisites
Examples
Real-World Applications
Where topic Is Applied
Resources
Interview Questions

Description

Outliers are data points that significantly differ from other observations in a dataset. They can skew analysis, lead to incorrect conclusions, and affect model performance. Two common methods to detect outliers are:
IQR (Interquartile Range) Method
Z-score (Standard Score) Method

Prerequisites

Understanding of statistics
Concept of mean, median, quartiles, and standard deviation
Basic Python & NumPy/Pandas knowledge

Examples

Here's a simple example of a data science task using Python:


# IQR Method (Interquartile Range)
import pandas as pd

# Sample data
data = pd.DataFrame({'scores': [10, 12, 14, 15, 16, 18, 19, 30, 40, 100]})

# Calculate IQR
Q1 = data['scores'].quantile(0.25)
Q3 = data['scores'].quantile(0.75)
IQR = Q3 - Q1

# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter outliers
outliers = data[(data['scores'] < lower_bound) | (data['scores'] > upper_bound)]
print("Outliers detected using IQR:\n", outliers)

# Z-score Method
import numpy as np
from scipy import stats

# Convert column to numpy array
scores = np.array(data['scores'])

# Calculate Z-scores
z_scores = stats.zscore(scores)

# Find outliers (Z-score threshold ±3)
outliers_z = scores[(z_scores > 3) | (z_scores < -3)]
print("Outliers detected using Z-score:\n", outliers_z)

Real-World Applications

Finance

Fraud detection (suspicious transactions)

Healthcare

Unusual medical test results

Cybersecurity

Intrusion detection from access logs

Where topic Is Applied

Data preprocessing

Anomaly detection systems

Exploratory data analysis (EDA)

Resources

Data Science topic PDF

Download

Harvard Data Science Course

Free online course from Harvard covering data science foundations

Visit

Interview Questions

➤ An outlier is a data point that deviates significantly from other observations.

➤ It uses the interquartile range (Q3 - Q1) and identifies values that fall below Q1 - 1.5×IQR or above Q3 + 1.5×IQR.

➤ It measures how many standard deviations a data point is from the mean. Points with Z-scores > 3 or < -3 are considered outliers.

➤ IQR is preferred for skewed distributions since it is not affected by extreme values as much as the Z-score (which assumes normality).

➤ No, only if they are due to errors or not relevant. Some outliers may represent rare but valid conditions (e.g., fraud).

Data Science in my style