bias-variance
Table of Contents
Description
The Bias-Variance trade-off is a fundamental concept in machine learning that describes how well a model generalizes to unseen data.
Bias refers to the error due to overly simplistic assumptions in the model. High bias leads to underfitting.
Variance refers to the error due to too much complexity in the model. High variance leads to overfitting.
The goal is to find the right balance: low bias + low variance.
Prerequisites
- Understanding of supervised learning
- Knowledge of training/testing splits
- Model evaluation metrics (like MSE, accuracy)
Examples
Here's a simple example of a data science task using Python:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=1, noise=20, random_state=42)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Low variance, high bias model (Linear Regression)
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
# High variance, low bias model (Decision Tree)
dt = DecisionTreeRegressor()
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
# Compare errors
print("Linear Regression MSE (High Bias):", mean_squared_error(y_test, y_pred_lr))
print("Decision Tree MSE (High Variance):", mean_squared_error(y_test, y_pred_dt))
🔍 This code compares a high-bias model (linear regression) and a high-variance model (decision tree) using Mean Squared Error (MSE).
Real-World Applications
Finance
Fraud detection models balancing over/under predictions
Healthcare
Choosing models for disease prediction (avoid underfitting or overfitting)
Marketing
Campaign response prediction
Where topic Is Applied
- Model selection and evaluation
- Error analysis in supervised learning
- Tuning model complexity (regularization, pruning, etc.)
Resources
Data Science topic PDF
Harvard Data Science Course
Free online course from Harvard covering data science foundations
Interview Questions
➤ Bias is the error from wrong assumptions in the learning algorithm. High bias can cause the model to miss relevant patterns.
➤ Variance is the error from sensitivity to small fluctuations in the training set. High variance can cause overfitting.
➤ It is the trade-off between the error introduced by bias and variance. The aim is to find a model with optimal balance.
➤ Use more training data, regularization, or simpler models.
➤ Use more complex models or reduce underfitting through better features.