Hierarchical Clustering

Description

Hierarchical Clustering is an unsupervised learning method used to group similar objects into clusters based on their distance or similarity. Unlike K-Means, it does not require the number of clusters to be defined upfront and builds a tree-like structure called a dendrogram to represent data hierarchy.

How It Works

  • Agglomerative (Bottom-Up): Starts with each data point as a single cluster and merges the closest pairs iteratively until all points are merged into one cluster.
  • Divisive (Top-Down): Starts with all data in one cluster and splits it recursively into smaller clusters.
  • Uses a distance metric (e.g., Euclidean) and a linkage criterion (e.g., single, complete, average) to measure similarity between clusters.
  • Results are visualized using a dendrogram, which helps in deciding the number of clusters by cutting the tree at a specific height.

Examples

Python Example of Hierarchical Clustering

import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as sch
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering

# Generate synthetic data
X, _ = make_blobs(n_samples=150, centers=4, cluster_std=0.60, random_state=0)

# Dendrogram
plt.figure(figsize=(10, 7))
dendrogram = sch.dendrogram(sch.linkage(X, method='ward'))
plt.title("Dendrogram")
plt.xlabel("Samples")
plt.ylabel("Euclidean Distances")
plt.show()

# Agglomerative Clustering
model = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')
y_pred = model.fit_predict(X)

# Cluster Visualization
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='rainbow')
plt.title("Hierarchical Clustering")
plt.show()

Real-World Applications

Hierarchical Clustering Applications

  • Bioinformatics: Creating phylogenetic trees, gene expression data analysis
  • Social Network Analysis: Grouping users with similar behavior or interests
  • Market Research: Customer segmentation for targeted marketing
  • Document Classification: Clustering articles or research papers based on topics
Hierarchical dendrogram visualization

Resources

The following resources will be manually added later:

Video Tutorials

Interview Questions

1. What is the difference between agglomerative and divisive hierarchical clustering?

Show Answer

Agglomerative: Bottom-up approach where each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

Divisive: Top-down approach where all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

2. How do you decide the number of clusters in hierarchical clustering?

Show Answer

By analyzing the dendrogram and selecting a threshold to cut the tree. The number of vertical lines intersected by the cut corresponds to the number of clusters.

3. What is a dendrogram?

Show Answer

A dendrogram is a tree-like diagram that records the sequences of merges or splits in hierarchical clustering. It helps in visualizing the arrangement of clusters.

4. What are linkage criteria in hierarchical clustering?

Show Answer

Single linkage: Minimum distance between points in two clusters

Complete linkage: Maximum distance between points in two clusters

Average linkage: Average distance between all points in the two clusters

Ward linkage: Minimizes the variance within each cluster

5. Is hierarchical clustering scalable to large datasets?

Show Answer

No, hierarchical clustering is computationally expensive (O(n²) space and time complexity), which makes it unsuitable for very large datasets without optimization.