k-Nearest Neighbors (k-NN)

Description

k-Nearest Neighbors (k-NN) is a simple, non-parametric, instance-based learning algorithm used for both classification and regression tasks. It makes predictions based on the closest training examples in the feature space.

How k-NN Works

When a new data point is encountered, the algorithm:

  • Calculates the distance (e.g., Euclidean) from the new point to all training points
  • Selects the k nearest neighbors
  • For classification: Predicts the majority class among the neighbors
  • For regression: Averages the values of the neighbors

It is a lazy learning algorithm, meaning it does not learn an explicit model but stores all training data.

Examples

Python Code for k-NN Classification

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train k-NN classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# Predict and evaluate
y_pred = knn.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")

Real-World Applications

k-Nearest Neighbors Applications

  • Healthcare: Disease classification based on patient symptoms
  • Finance: Credit risk prediction, fraud detection
  • Recommender Systems: Suggesting products based on user similarity
  • Pattern Recognition: Handwriting and facial recognition
Medical classification example

Resources

The following resources will be manually added later:

Video Tutorials

Interview Questions

1. What is k-Nearest Neighbors (k-NN) and how does it work?

Show Answer

k-NN is a supervised learning algorithm that classifies or predicts a data point based on the majority label (for classification) or average value (for regression) of its k closest neighbors in the training dataset.

2. What are the main advantages and disadvantages of k-NN?

Show Answer

Advantages: Simple, effective, no training phase, adaptable to multi-class problems.

Disadvantages: Computationally expensive at prediction time, sensitive to irrelevant features and feature scaling, affected by noisy data.

3. How do you choose the value of k in k-NN?

Show Answer

The value of k is chosen using cross-validation. A small k can lead to overfitting, while a large k may smooth out class boundaries too much (underfitting). Odd values are often preferred to break ties in binary classification.

4. What distance metrics can be used in k-NN?

Show Answer

Common metrics include:

  • Euclidean distance
  • Manhattan distance
  • Minkowski distance
  • Cosine similarity

5. How does feature scaling impact k-NN?

Show Answer

Feature scaling is crucial because k-NN relies on distance metrics. Features with larger scales can dominate the distance calculation, so normalization (e.g., Min-Max or Standard Scaling) is often applied.