K-Means Clustering

Description

K-Means Clustering is an unsupervised machine learning algorithm used to partition a dataset into K distinct clusters based on similarity. It works by iteratively assigning data points to clusters and updating the cluster centers (centroids) until convergence.

How It Works

  • Choose the number of clusters (K)
  • Initialize K centroids randomly
  • Assign each data point to the nearest centroid
  • Recompute the centroids as the mean of assigned points
  • Repeat assignment and update steps until centroids do not change significantly

Examples

Python Example of K-Means Clustering

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# Create sample data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Fit KMeans
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

# Plot results
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.75)
plt.title('K-Means Clustering')
plt.show()

Real-World Applications

K-Means Clustering Applications

  • Customer Segmentation: Grouping customers based on purchasing behavior
  • Image Compression: Reducing the number of colors in an image using color quantization
  • Document Clustering: Grouping similar documents for topic discovery
  • Anomaly Detection: Identifying unusual data patterns by analyzing cluster outliers
Clustered customer segments

Resources

The following resources will be manually added later:

Video Tutorials

Interview Questions

1. What is the main objective of the K-Means algorithm?

Show Answer

The main objective is to minimize the within-cluster sum of squares (WCSS) by finding cluster centers that reduce the total distance between points and their centroids.

2. How do you choose the value of K in K-Means?

Show Answer

The "elbow method" is commonly used, where WCSS is plotted for various K values. The "elbow" point (where WCSS starts to decrease slowly) indicates the optimal K.

3. Is K-Means sensitive to the initial placement of centroids?

Show Answer

Yes, different initial placements can lead to different results. K-Means++ initialization is commonly used to improve centroid selection.

4. Can K-Means handle non-spherical clusters?

Show Answer

No, K-Means assumes clusters are spherical and equally sized, which makes it less effective for irregular or overlapping clusters.

5. What is the computational complexity of K-Means?

Show Answer

The time complexity is O(n * k * i * d), where n = number of data points, k = number of clusters, i = number of iterations, d = dimensionality.