outlier detection
Table of Contents
Description
Outliers are data points that significantly differ from other observations in a dataset. They can skew analysis, lead to incorrect conclusions, and affect model performance. Two common methods to detect outliers are:
IQR (Interquartile Range) Method
Z-score (Standard Score) Method
Prerequisites
- Understanding of statistics
- Concept of mean, median, quartiles, and standard deviation
- Basic Python & NumPy/Pandas knowledge
Examples
Here's a simple example of a data science task using Python:
# IQR Method (Interquartile Range)
import pandas as pd
# Sample data
data = pd.DataFrame({'scores': [10, 12, 14, 15, 16, 18, 19, 30, 40, 100]})
# Calculate IQR
Q1 = data['scores'].quantile(0.25)
Q3 = data['scores'].quantile(0.75)
IQR = Q3 - Q1
# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Filter outliers
outliers = data[(data['scores'] < lower_bound) | (data['scores'] > upper_bound)]
print("Outliers detected using IQR:\n", outliers)
# Z-score Method
import numpy as np
from scipy import stats
# Convert column to numpy array
scores = np.array(data['scores'])
# Calculate Z-scores
z_scores = stats.zscore(scores)
# Find outliers (Z-score threshold ±3)
outliers_z = scores[(z_scores > 3) | (z_scores < -3)]
print("Outliers detected using Z-score:\n", outliers_z)
Real-World Applications
Finance
Fraud detection (suspicious transactions)
Healthcare
Unusual medical test results
Cybersecurity
Intrusion detection from access logs
Where topic Is Applied
- Data preprocessing
- Anomaly detection systems
- Exploratory data analysis (EDA)
Resources
Data Science topic PDF
Harvard Data Science Course
Free online course from Harvard covering data science foundations
Interview Questions
➤ An outlier is a data point that deviates significantly from other observations.
➤ It uses the interquartile range (Q3 - Q1) and identifies values that fall below Q1 - 1.5×IQR or above Q3 + 1.5×IQR.
➤ It measures how many standard deviations a data point is from the mean. Points with Z-scores > 3 or < -3 are considered outliers.
➤ IQR is preferred for skewed distributions since it is not affected by extreme values as much as the Z-score (which assumes normality).
➤ No, only if they are due to errors or not relevant. Some outliers may represent rare but valid conditions (e.g., fraud).