Home / Chapter 2
Chapter 2 · Unsupervised

Clustering & Dimensionality Reduction

Unsupervised learning finds structure without labels. Clustering algorithms group similar points together; dimensionality reduction techniques project high-dimensional data to 2D or 3D while preserving structure. Both are essential for data exploration, visualization, and preprocessing.

5 algorithms
Interactive K-Means
Updated Latest

K-Means Clustering

Unsupervised Partitional O(n·k·d·iter)

K-Means partitions $n$ points into $k$ clusters by minimizing the within-cluster sum of squared distances (inertia). It alternates between assigning each point to its nearest centroid and recomputing centroids until convergence.

Objective (Inertia)
$$\mathcal{J} = \sum_{j=1}^{k} \sum_{x_i \in C_j} \|x_i - \mu_j\|^2$$

$\mu_j$ is the centroid of cluster $C_j$. We minimize $\mathcal{J}$ by iterating: (1) assign each $x_i$ to the nearest $\mu_j$, (2) update $\mu_j = \frac{1}{|C_j|}\sum_{x_i \in C_j} x_i$.

⚡ Interactive K-Means — Click to Add Points
k: 3
💡
K-Means++ Initialization

Standard K-Means uses random centroid initialization which can converge to poor solutions. K-Means++ (Arthur & Vassilvitskii, 2007) chooses initial centroids with probability proportional to distance — reducing inertia by $O(\log k)$ on average. All sklearn/scikit-learn implementations use K-Means++ by default.

Python
import numpy as np

def kmeans(X, k, max_iter=300, tol=1e-4):
    """K-Means++ initialization + Lloyd's algorithm."""
    n, d = X.shape
    # K-Means++ initialization
    centers = [X[np.random.randint(n)]]
    for _ in range(k - 1):
        dists = np.min([np.sum((X - c)**2, axis=1) for c in centers], axis=0)
        probs = dists / dists.sum()
        centers.append(X[np.random.choice(n, p=probs)])
    centers = np.array(centers)

    for _ in range(max_iter):
        # E-step: assign clusters
        dists = np.linalg.norm(X[:, None] - centers[None], axis=-1)
        labels = np.argmin(dists, axis=1)
        # M-step: update centroids
        new_centers = np.array([X[labels == j].mean(axis=0) for j in range(k)])
        if np.linalg.norm(new_centers - centers) < tol:
            break
        centers = new_centers
    return labels, centers

DBSCAN

Unsupervised Density-Based O(n log n)

Density-Based Spatial Clustering of Applications with Noise groups together points that are closely packed, marking sparse regions as noise. Unlike K-Means, DBSCAN doesn't require you to specify the number of clusters and can find arbitrarily-shaped clusters.

Core Point Condition
$$|N_\varepsilon(p)| \geq \text{minPts}$$

Point $p$ is a core point if at least minPts points lie within $\varepsilon$-neighborhood $N_\varepsilon(p) = \{q : \|p - q\| \leq \varepsilon\}$. Border points are reachable from a core; noise points are neither.

AlgorithmShapeNoise Handlingk RequiredScale
K-MeansConvex onlyNoYesLarge
DBSCANArbitraryYes ✓NoMedium
HierarchicalArbitraryPartialPost-hocSmall
GMMEllipsoidalSoft assignYesMedium

Principal Component Analysis

Reduction Linear O(n·d² + d³)

PCA finds the orthogonal directions (principal components) of maximum variance in the data. By projecting onto the top $k$ components, it reduces dimensionality while preserving as much variance as possible. It's the workhorse of feature preprocessing and visualization.

PCA via Eigendecomposition
$$\Sigma = \frac{1}{n-1} X^TX \quad \text{(covariance)}$$ $$\Sigma = V \Lambda V^T \quad \text{(eigendecomposition)}$$ $$Z = XV_k \quad \text{(project onto top-}k\text{ eigenvectors)}$$

$V_k$ contains the top-$k$ eigenvectors. In practice, PCA is computed via SVD which is numerically more stable: $X = U\Sigma V^T$, so principal components are the columns of $V$.

⚡ PCA — Variance Explained by Components
Python
import numpy as np

def pca(X, k):
    """PCA via SVD. Returns projected data and explained variance ratio."""
    X_c = X - X.mean(axis=0)               # center
    U, S, Vt = np.linalg.svd(X_c, full_matrices=False)
    explained = (S**2) / (S**2).sum()
    Z = X_c @ Vt[:k].T                   # project
    return Z, explained[:k], Vt[:k]       # scores, variance, components

t-SNE

Reduction Non-Linear Maaten & Hinton 2008

t-SNE (t-distributed Stochastic Neighbor Embedding) is the gold standard for visualizing high-dimensional data in 2D. Unlike PCA, it preserves local neighborhood structure rather than global variance — clusters that are tightly packed in high dimensions appear tightly packed in the 2D projection.

t-SNE Cost (Kullback-Leibler Divergence)
$$p_{j|i} = \frac{\exp(-\|x_i - x_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i}\exp(-\|x_i - x_k\|^2 / 2\sigma_i^2)}$$ $$q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l}(1 + \|y_k - y_l\|^2)^{-1}}$$ $$\mathcal{L} = KL(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$$

The heavy-tailed Student-t distribution for $q$ (vs Gaussian for $p$) alleviates the "crowding problem" — points at moderate distances can be separated more clearly in 2D.

⚠️
t-SNE Pitfalls

t-SNE is not deterministic, cluster sizes and distances are not meaningful, and it's $O(n^2)$ without approximations. Use UMAP for large datasets or when you need a stable, globally-faithful embedding.

UMAP

Reduction Non-Linear McInnes et al. 2018

UMAP (Uniform Manifold Approximation and Projection) is the modern successor to t-SNE. It's built on Riemannian geometry and algebraic topology, preserves both local and global structure, runs in $O(n^{1.14})$ (approximate), and supports out-of-sample transformation — making it suitable for production ML pipelines.

PCAt-SNEUMAP
MethodLinearNon-linearNon-linear
PreservesGlobal varianceLocal structureLocal + global
SpeedVery fastSlow $O(n^2)$Fast $O(n^{1.14})$
DeterministicYesNoWith seed
Out-of-sampleYesNoYes
Inverse transformYesNoApproximate
🚀
UMAP in Production

UMAP is used at scale in genomics (single-cell RNA-seq visualization), NLP (embedding exploration), and recommendation systems. Paired with HDBSCAN for clustering, it's one of the most powerful unsupervised analysis pipelines available.