Chapter 2: Clustering & Dimensionality Reduction

K-Means Clustering

Unsupervised Partitional O(n·k·d·iter)

K-Means partitions $n$ points into $k$ clusters by minimizing the within-cluster sum of squared distances (inertia). It alternates between assigning each point to its nearest centroid and recomputing centroids until convergence.

Objective (Inertia) $$\mathcal{J} = \sum_{j=1}^{k} \sum_{x_i \in C_j} \|x_i - \mu_j\|^2$$ $\mu_j$ is the centroid of cluster $C_j$. We minimize $\mathcal{J}$ by iterating: (1) assign each $x_i$ to the nearest $\mu_j$, (2) update $\mu_j = \frac{1}{|C_j|}\sum_{x_i \in C_j} x_i$.

⚡ Interactive K-Means — Click to Add Points

k: 3

💡

K-Means++ Initialization

Standard K-Means uses random centroid initialization which can converge to poor solutions. K-Means++ (Arthur & Vassilvitskii, 2007) chooses initial centroids with probability proportional to distance — reducing inertia by $O(\log k)$ on average. All sklearn/scikit-learn implementations use K-Means++ by default.

Python

import numpy as np

def kmeans(X, k, max_iter=300, tol=1e-4):
    """K-Means++ initialization + Lloyd's algorithm."""
    n, d = X.shape
    # K-Means++ initialization
    centers = [X[np.random.randint(n)]]
    for _ in range(k - 1):
        dists = np.min([np.sum((X - c)**2, axis=1) for c in centers], axis=0)
        probs = dists / dists.sum()
        centers.append(X[np.random.choice(n, p=probs)])
    centers = np.array(centers)

    for _ in range(max_iter):
        # E-step: assign clusters
        dists = np.linalg.norm(X[:, None] - centers[None], axis=-1)
        labels = np.argmin(dists, axis=1)
        # M-step: update centroids
        new_centers = np.array([X[labels == j].mean(axis=0) for j in range(k)])
        if np.linalg.norm(new_centers - centers) < tol:
            break
        centers = new_centers
    return labels, centers

DBSCAN

Unsupervised Density-Based O(n log n)

Density-Based Spatial Clustering of Applications with Noise groups together points that are closely packed, marking sparse regions as noise. Unlike K-Means, DBSCAN doesn't require you to specify the number of clusters and can find arbitrarily-shaped clusters.

Core Point Condition $$|N_\varepsilon(p)| \geq \text{minPts}$$ Point $p$ is a core point if at least minPts points lie within $\varepsilon$-neighborhood $N_\varepsilon(p) = \{q : \|p - q\| \leq \varepsilon\}$. Border points are reachable from a core; noise points are neither.

Algorithm	Shape	Noise Handling	k Required	Scale
K-Means	Convex only	No	Yes	Large
DBSCAN	Arbitrary	Yes ✓	No	Medium
Hierarchical	Arbitrary	Partial	Post-hoc	Small
GMM	Ellipsoidal	Soft assign	Yes	Medium

Principal Component Analysis

Reduction Linear O(n·d² + d³)

PCA finds the orthogonal directions (principal components) of maximum variance in the data. By projecting onto the top $k$ components, it reduces dimensionality while preserving as much variance as possible. It's the workhorse of feature preprocessing and visualization.

PCA via Eigendecomposition $$\Sigma = \frac{1}{n-1} X^TX \quad \text{(covariance)}$$ $$\Sigma = V \Lambda V^T \quad \text{(eigendecomposition)}$$ $$Z = XV_k \quad \text{(project onto top-}k\text{ eigenvectors)}$$ $V_k$ contains the top-$k$ eigenvectors. In practice, PCA is computed via SVD which is numerically more stable: $X = U\Sigma V^T$, so principal components are the columns of $V$.

⚡ PCA — Variance Explained by Components

Python

import numpy as np

def pca(X, k):
    """PCA via SVD. Returns projected data and explained variance ratio."""
    X_c = X - X.mean(axis=0)               # center
    U, S, Vt = np.linalg.svd(X_c, full_matrices=False)
    explained = (S**2) / (S**2).sum()
    Z = X_c @ Vt[:k].T                   # project
    return Z, explained[:k], Vt[:k]       # scores, variance, components

t-SNE

Reduction Non-Linear Maaten & Hinton 2008

t-SNE (t-distributed Stochastic Neighbor Embedding) is the gold standard for visualizing high-dimensional data in 2D. Unlike PCA, it preserves local neighborhood structure rather than global variance — clusters that are tightly packed in high dimensions appear tightly packed in the 2D projection.

t-SNE Cost (Kullback-Leibler Divergence) $$p_{j|i} = \frac{\exp(-\|x_i - x_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i}\exp(-\|x_i - x_k\|^2 / 2\sigma_i^2)}$$ $$q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l}(1 + \|y_k - y_l\|^2)^{-1}}$$ $$\mathcal{L} = KL(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$$ The heavy-tailed Student-t distribution for $q$ (vs Gaussian for $p$) alleviates the "crowding problem" — points at moderate distances can be separated more clearly in 2D.

⚠️

t-SNE Pitfalls

t-SNE is not deterministic, cluster sizes and distances are not meaningful, and it's $O(n^2)$ without approximations. Use UMAP for large datasets or when you need a stable, globally-faithful embedding.

UMAP

Reduction Non-Linear McInnes et al. 2018

UMAP (Uniform Manifold Approximation and Projection) is the modern successor to t-SNE. It's built on Riemannian geometry and algebraic topology, preserves both local and global structure, runs in $O(n^{1.14})$ (approximate), and supports out-of-sample transformation — making it suitable for production ML pipelines.

	PCA	t-SNE	UMAP
Method	Linear	Non-linear	Non-linear
Preserves	Global variance	Local structure	Local + global
Speed	Very fast	Slow $O(n^2)$	Fast $O(n^{1.14})$
Deterministic	Yes	No	With seed
Out-of-sample	Yes	No	Yes
Inverse transform	Yes	No	Approximate

🚀

UMAP in Production

UMAP is used at scale in genomics (single-cell RNA-seq visualization), NLP (embedding exploration), and recommendation systems. Paired with HDBSCAN for clustering, it's one of the most powerful unsupervised analysis pipelines available.