Variational Autoencoders (VAE)
A VAE learns a probabilistic latent space by encoding data $x$ into a distribution $q(z|x) = \mathcal{N}(\mu, \sigma^2)$ rather than a point. The decoder reconstructs $x$ from samples $z \sim q(z|x)$. The reparameterization trick makes this differentiable, enabling end-to-end training.
The KL term forces the latent distribution toward $\mathcal{N}(0,I)$, creating a structured latent space. Interpolation and generation: sample $z \sim \mathcal{N}(0,I)$, decode → new image.
Generative Adversarial Networks
GANs pit two networks against each other: a Generator $G$ that creates fake samples from noise, and a Discriminator $D$ that distinguishes real from fake. They improve jointly in a minimax game: $G$ fools $D$; $D$ gets better at detecting fakes; $G$ must improve again.
At the Nash equilibrium, $G$ produces the data distribution and $D$ outputs $\frac{1}{2}$ everywhere. In practice, GANs are notoriously hard to train — they suffer from mode collapse, training instability, and require careful hyperparameter tuning.
Why GANs Lost to Diffusion
While GAN outputs (StyleGAN2, etc.) were sharp and high-fidelity, training instability and mode collapse were persistent pain points. Diffusion models (2020+) beat GANs on FID scores by 2022, and offer stable training, exact likelihood, and better coverage of the data distribution. GANs are now mainly used in super-resolution and video generation.
Diffusion Models (DDPM)
Diffusion models define a forward process that gradually adds Gaussian noise to data over $T$ steps until it becomes pure noise, then learn to reverse this process — denoising noise back into a sample. This is what powers Stable Diffusion, DALL-E 3, Imagen, and Sora.
The network $\epsilon_\theta$ predicts the noise $\epsilon$ added at each step. At inference, starting from $x_T \sim \mathcal{N}(0,I)$, apply $T$ denoising steps to get $x_0$.
import torch
class DDPM:
def __init__(self, T=1000):
self.T = T
# Linear noise schedule
self.betas = torch.linspace(1e-4, 0.02, T)
self.alphas = 1 - self.betas
self.alpha_bar = torch.cumprod(self.alphas, dim=0)
def q_sample(self, x0, t, eps=None):
"""Forward: x0 → x_t (closed form)"""
if eps is None: eps = torch.randn_like(x0)
a_bar_t = self.alpha_bar[t][..., None, None, None]
return a_bar_t.sqrt() * x0 + (1 - a_bar_t).sqrt() * eps, eps
def p_sample(self, model, xt, t):
"""Reverse: x_t → x_{t-1} (single step)"""
eps_pred = model(xt, t)
a, a_bar = self.alphas[t], self.alpha_bar[t]
mu = (1 / a.sqrt()) * (xt - (1-a)/(1-a_bar).sqrt() * eps_pred)
noise = torch.randn_like(xt) if t > 0 else 0
return mu + self.betas[t].sqrt() * noise
Flow Matching
Flow Matching (FM) directly trains a velocity field $v_\theta(x, t)$ to interpolate between a noise distribution and the data distribution along straight trajectories. Unlike diffusion which uses a stochastic SDE, FM uses an ODE — this enables fewer function evaluations (NFE) at inference. Meta's Llama 3 image model and Stability AI's Stable Diffusion 3 use FM.
$x_0 \sim \mathcal{N}(0,I)$ (noise), $x_1 \sim p_{data}$ (data). The velocity target is just $x_1 - x_0$ — a straight line. This is dramatically simpler than DDPM and requires ~10 steps at inference vs. 1000 for DDPM.
| DDPM | DDIM | Flow Matching | Consistency | |
|---|---|---|---|---|
| Training | Noise pred. | Noise pred. | Velocity pred. | Self-distill |
| Inference steps | ~1000 | ~50 | ~10 | 1–4 |
| Sample quality | Excellent | Good | Excellent | Good |
| Used in | Original Stable Diffusion | SD v1/v2 | SD3, Flux, Sora | LCM, TCD |
| Math type | SDE | ODE approx. | ODE | ODE |
Rectified Flow & SD 3 (Recent)
Stability AI's Stable Diffusion 3 uses Rectified Flow (Liu et al. 2022) — a variant of Flow Matching that reflows trajectories to be straighter via repeated iterations. Straighter ODE paths → fewer steps → faster generation. Combined with a DiT (Diffusion Transformer) backbone, SD3 achieves state-of-the-art text-to-image quality.
Generative Model Comparison
| Model | Likelihood | Mode Coverage | Speed | Stability |
|---|---|---|---|---|
| VAE | Explicit (ELBO) | Good | Fast (1 forward) | Stable ✓ |
| GAN | None | Mode collapse risk | Fast | Unstable ✗ |
| DDPM | ELBO | Excellent | Slow (1000 steps) | Stable ✓ |
| Flow Matching | Exact | Excellent | Fast (10 steps) | Stable ✓ |
| Normalizing Flows | Exact | Good | Medium | Stable ✓ |