Chapter 6: Generative Models

Variational Autoencoders (VAE)

Generative Latent Variable Kingma & Welling 2014

A VAE learns a probabilistic latent space by encoding data $x$ into a distribution $q(z|x) = \mathcal{N}(\mu, \sigma^2)$ rather than a point. The decoder reconstructs $x$ from samples $z \sim q(z|x)$. The reparameterization trick makes this differentiable, enabling end-to-end training.

ELBO (Evidence Lower Bound) $$\mathcal{L}_{VAE} = \underbrace{\mathbb{E}_{z \sim q}[\log p(x|z)]}_{\text{reconstruction}} - \underbrace{D_{KL}(q(z|x) \| p(z))}_{\text{regularization}}$$ $$\text{Reparameterization: } z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0,I)$$ The KL term forces the latent distribution toward $\mathcal{N}(0,I)$, creating a structured latent space. Interpolation and generation: sample $z \sim \mathcal{N}(0,I)$, decode → new image.

⚡ VAE Latent Space — 2D Gaussian Sampling

Generative Adversarial Networks

Generative Adversarial Goodfellow et al. 2014

GANs pit two networks against each other: a Generator $G$ that creates fake samples from noise, and a Discriminator $D$ that distinguishes real from fake. They improve jointly in a minimax game: $G$ fools $D$; $D$ gets better at detecting fakes; $G$ must improve again.

GAN Minimax Objective $$\min_G \max_D \; \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$ At the Nash equilibrium, $G$ produces the data distribution and $D$ outputs $\frac{1}{2}$ everywhere. In practice, GANs are notoriously hard to train — they suffer from mode collapse, training instability, and require careful hyperparameter tuning.

⚡ GAN Training — Generator vs Discriminator

⚠️

Why GANs Lost to Diffusion

While GAN outputs (StyleGAN2, etc.) were sharp and high-fidelity, training instability and mode collapse were persistent pain points. Diffusion models (2020+) beat GANs on FID scores by 2022, and offer stable training, exact likelihood, and better coverage of the data distribution. GANs are now mainly used in super-resolution and video generation.

Diffusion Models (DDPM)

Generative Denoising Ho et al. 2020

Diffusion models define a forward process that gradually adds Gaussian noise to data over $T$ steps until it becomes pure noise, then learn to reverse this process — denoising noise back into a sample. This is what powers Stable Diffusion, DALL-E 3, Imagen, and Sora.

Forward Process (Data → Noise) $$q(x_t | x_{t-1}) = \mathcal{N}(x_t;\, \sqrt{1-\beta_t}\,x_{t-1},\, \beta_t I)$$ $$q(x_t|x_0) = \mathcal{N}(x_t;\, \sqrt{\bar{\alpha}_t}\,x_0,\, (1-\bar{\alpha}_t)I), \quad \bar{\alpha}_t = \prod_{s=1}^t(1-\beta_s)$$ Reverse Process (Noise → Data) $$p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1};\, \mu_\theta(x_t,t),\, \sigma_t^2 I)$$ $$\mathcal{L}_{simple} = \mathbb{E}_{t,x_0,\epsilon}\left[\|\epsilon - \epsilon_\theta(\sqrt{\bar\alpha_t}x_0 + \sqrt{1-\bar\alpha_t}\epsilon,\, t)\|^2\right]$$ The network $\epsilon_\theta$ predicts the noise $\epsilon$ added at each step. At inference, starting from $x_T \sim \mathcal{N}(0,I)$, apply $T$ denoising steps to get $x_0$.

⚡ Diffusion Forward & Reverse Process

Python (PyTorch)

import torch

class DDPM:
    def __init__(self, T=1000):
        self.T = T
        # Linear noise schedule
        self.betas = torch.linspace(1e-4, 0.02, T)
        self.alphas = 1 - self.betas
        self.alpha_bar = torch.cumprod(self.alphas, dim=0)

    def q_sample(self, x0, t, eps=None):
        """Forward: x0 → x_t (closed form)"""
        if eps is None: eps = torch.randn_like(x0)
        a_bar_t = self.alpha_bar[t][..., None, None, None]
        return a_bar_t.sqrt() * x0 + (1 - a_bar_t).sqrt() * eps, eps

    def p_sample(self, model, xt, t):
        """Reverse: x_t → x_{t-1} (single step)"""
        eps_pred = model(xt, t)
        a, a_bar = self.alphas[t], self.alpha_bar[t]
        mu = (1 / a.sqrt()) * (xt - (1-a)/(1-a_bar).sqrt() * eps_pred)
        noise = torch.randn_like(xt) if t > 0 else 0
        return mu + self.betas[t].sqrt() * noise

Flow Matching

Generative ODE-based Lipman et al. 2022

Flow Matching (FM) directly trains a velocity field $v_\theta(x, t)$ to interpolate between a noise distribution and the data distribution along straight trajectories. Unlike diffusion which uses a stochastic SDE, FM uses an ODE — this enables fewer function evaluations (NFE) at inference. Meta's Llama 3 image model and Stability AI's Stable Diffusion 3 use FM.

Flow Matching Objective (Lipman et al., 2022) $$x_t = (1-t)x_0 + t x_1, \quad t \in [0, 1]$$ $$\mathcal{L}_{FM} = \mathbb{E}_{t,x_0,x_1}\left[\|v_\theta(x_t, t) - (x_1 - x_0)\|^2\right]$$ $x_0 \sim \mathcal{N}(0,I)$ (noise), $x_1 \sim p_{data}$ (data). The velocity target is just $x_1 - x_0$ — a straight line. This is dramatically simpler than DDPM and requires ~10 steps at inference vs. 1000 for DDPM.

	DDPM	DDIM	Flow Matching	Consistency
Training	Noise pred.	Noise pred.	Velocity pred.	Self-distill
Inference steps	~1000	~50	~10	1–4
Sample quality	Excellent	Good	Excellent	Good
Used in	Original Stable Diffusion	SD v1/v2	SD3, Flux, Sora	LCM, TCD
Math type	SDE	ODE approx.	ODE	ODE

🔬

Rectified Flow & SD 3 (Recent)

Stability AI's Stable Diffusion 3 uses Rectified Flow (Liu et al. 2022) — a variant of Flow Matching that reflows trajectories to be straighter via repeated iterations. Straighter ODE paths → fewer steps → faster generation. Combined with a DiT (Diffusion Transformer) backbone, SD3 achieves state-of-the-art text-to-image quality.

Generative Model Comparison

Overview

⚡ Quality vs Speed Tradeoff Across Generative Paradigms

Model	Likelihood	Mode Coverage	Speed	Stability
VAE	Explicit (ELBO)	Good	Fast (1 forward)	Stable ✓
GAN	None	Mode collapse risk	Fast	Unstable ✗
DDPM	ELBO	Excellent	Slow (1000 steps)	Stable ✓
Flow Matching	Exact	Excellent	Fast (10 steps)	Stable ✓
Normalizing Flows	Exact	Good	Medium	Stable ✓