K-Means Clustering
K-Means partitions $n$ points into $k$ clusters by minimizing the within-cluster sum of squared distances (inertia). It alternates between assigning each point to its nearest centroid and recomputing centroids until convergence.
$\mu_j$ is the centroid of cluster $C_j$. We minimize $\mathcal{J}$ by iterating: (1) assign each $x_i$ to the nearest $\mu_j$, (2) update $\mu_j = \frac{1}{|C_j|}\sum_{x_i \in C_j} x_i$.
K-Means++ Initialization
Standard K-Means uses random centroid initialization which can converge to poor solutions. K-Means++ (Arthur & Vassilvitskii, 2007) chooses initial centroids with probability proportional to distance — reducing inertia by $O(\log k)$ on average. All sklearn/scikit-learn implementations use K-Means++ by default.
import numpy as np
def kmeans(X, k, max_iter=300, tol=1e-4):
"""K-Means++ initialization + Lloyd's algorithm."""
n, d = X.shape
# K-Means++ initialization
centers = [X[np.random.randint(n)]]
for _ in range(k - 1):
dists = np.min([np.sum((X - c)**2, axis=1) for c in centers], axis=0)
probs = dists / dists.sum()
centers.append(X[np.random.choice(n, p=probs)])
centers = np.array(centers)
for _ in range(max_iter):
# E-step: assign clusters
dists = np.linalg.norm(X[:, None] - centers[None], axis=-1)
labels = np.argmin(dists, axis=1)
# M-step: update centroids
new_centers = np.array([X[labels == j].mean(axis=0) for j in range(k)])
if np.linalg.norm(new_centers - centers) < tol:
break
centers = new_centers
return labels, centers
DBSCAN
Density-Based Spatial Clustering of Applications with Noise groups together points that are closely packed, marking sparse regions as noise. Unlike K-Means, DBSCAN doesn't require you to specify the number of clusters and can find arbitrarily-shaped clusters.
Point $p$ is a core point if at least minPts points lie within $\varepsilon$-neighborhood $N_\varepsilon(p) = \{q : \|p - q\| \leq \varepsilon\}$. Border points are reachable from a core; noise points are neither.
| Algorithm | Shape | Noise Handling | k Required | Scale |
|---|---|---|---|---|
| K-Means | Convex only | No | Yes | Large |
| DBSCAN | Arbitrary | Yes ✓ | No | Medium |
| Hierarchical | Arbitrary | Partial | Post-hoc | Small |
| GMM | Ellipsoidal | Soft assign | Yes | Medium |
Principal Component Analysis
PCA finds the orthogonal directions (principal components) of maximum variance in the data. By projecting onto the top $k$ components, it reduces dimensionality while preserving as much variance as possible. It's the workhorse of feature preprocessing and visualization.
$V_k$ contains the top-$k$ eigenvectors. In practice, PCA is computed via SVD which is numerically more stable: $X = U\Sigma V^T$, so principal components are the columns of $V$.
import numpy as np
def pca(X, k):
"""PCA via SVD. Returns projected data and explained variance ratio."""
X_c = X - X.mean(axis=0) # center
U, S, Vt = np.linalg.svd(X_c, full_matrices=False)
explained = (S**2) / (S**2).sum()
Z = X_c @ Vt[:k].T # project
return Z, explained[:k], Vt[:k] # scores, variance, components
t-SNE
t-SNE (t-distributed Stochastic Neighbor Embedding) is the gold standard for visualizing high-dimensional data in 2D. Unlike PCA, it preserves local neighborhood structure rather than global variance — clusters that are tightly packed in high dimensions appear tightly packed in the 2D projection.
The heavy-tailed Student-t distribution for $q$ (vs Gaussian for $p$) alleviates the "crowding problem" — points at moderate distances can be separated more clearly in 2D.
t-SNE Pitfalls
t-SNE is not deterministic, cluster sizes and distances are not meaningful, and it's $O(n^2)$ without approximations. Use UMAP for large datasets or when you need a stable, globally-faithful embedding.
UMAP
UMAP (Uniform Manifold Approximation and Projection) is the modern successor to t-SNE. It's built on Riemannian geometry and algebraic topology, preserves both local and global structure, runs in $O(n^{1.14})$ (approximate), and supports out-of-sample transformation — making it suitable for production ML pipelines.
| PCA | t-SNE | UMAP | |
|---|---|---|---|
| Method | Linear | Non-linear | Non-linear |
| Preserves | Global variance | Local structure | Local + global |
| Speed | Very fast | Slow $O(n^2)$ | Fast $O(n^{1.14})$ |
| Deterministic | Yes | No | With seed |
| Out-of-sample | Yes | No | Yes |
| Inverse transform | Yes | No | Approximate |
UMAP in Production
UMAP is used at scale in genomics (single-cell RNA-seq visualization), NLP (embedding exploration), and recommendation systems. Paired with HDBSCAN for clustering, it's one of the most powerful unsupervised analysis pipelines available.