Chapter 1: Optimization | ML Visualized

Gradient Descent

Optimization First Order O(n·d·iter)

Gradient Descent is the most fundamental optimization algorithm. Given a differentiable loss function $\mathcal{L}(\theta)$, it iteratively moves parameters $\theta$ in the direction of steepest descent (the negative gradient), until it reaches a minimum.

Update Rule $$\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t)$$ Where $\eta$ is the learning rate, $\nabla_\theta \mathcal{L}$ is the gradient of the loss with respect to parameters $\theta$.

⚡ Interactive: Gradient Descent on a Loss Landscape

lr: 0.05

The visualization shows a 2D contour plot of a loss surface with a ball rolling towards the minimum. Adjust the learning rate — too low and convergence is slow; too high and the updates overshoot.

💡

The Learning Rate Problem

A learning rate that's too small leads to slow convergence. Too large, and the parameters oscillate or diverge. This is why adaptive methods like Adam were invented.

Python

import numpy as np

def gradient_descent(loss_fn, grad_fn, theta_init, lr=0.01, epochs=1000):
    """
    Vanilla Gradient Descent.
    loss_fn  : callable, takes theta, returns scalar loss
    grad_fn  : callable, takes theta, returns gradient array
    theta_init: initial parameter vector
    lr       : learning rate (step size)
    """
    theta = theta_init.copy()
    history = [theta.copy()]

    for t in range(epochs):
        grad = grad_fn(theta)        # ∇L(θ)
        theta -= lr * grad            # θ ← θ - η·∇L
        history.append(theta.copy())

        if np.linalg.norm(grad) < 1e-6:  # convergence
            break

    return theta, history

Stochastic & Mini-Batch GD

Optimization Stochastic

Vanilla GD computes the gradient over the entire dataset — expensive for large datasets. SGD updates with one random sample at a time; Mini-Batch GD is the practical sweet spot: compute gradient on a batch of $B$ samples.

Mini-Batch Update $$\theta_{t+1} = \theta_t - \frac{\eta}{B} \sum_{i \in \mathcal{B}} \nabla_\theta \mathcal{L}(x_i, y_i; \theta_t)$$ $\mathcal{B}$ is a randomly sampled mini-batch of size $B$ from the training set. Modern LLMs use batch sizes of 1M+ tokens.

⚡ SGD vs Batch GD — Trajectory Comparison

Method	Gradient Estimate	Speed / Epoch	Noise	Memory
Full GD	Exact	Slow (1 update)	None	High
SGD	Noisy (1 sample)	Fast (n updates)	High	Low
Mini-Batch GD	Approximate	Balanced	Medium	Medium

SGD with Momentum

Optimization First Order

Momentum accelerates SGD by accumulating a velocity vector in directions of persistent gradient. It dampens oscillations and accelerates convergence — analogous to a ball rolling downhill and building up speed in consistent directions.

Momentum Update (Polyak, 1964) $$v_{t+1} = \beta v_t + \nabla_\theta \mathcal{L}(\theta_t)$$ $$\theta_{t+1} = \theta_t - \eta \, v_{t+1}$$ $\beta \approx 0.9$ is the momentum coefficient. The velocity $v_t$ is an exponentially weighted moving average of past gradients.

🏃

Nesterov Accelerated Gradient (NAG)

An improvement on momentum — evaluate the gradient at the lookahead position $\theta - \beta v$ before updating. This gives better convergence guarantees and is used in PyTorch's SGD optimizer.

Adam Optimizer

Optimization Adaptive Kingma & Ba 2015

Adam (Adaptive Moment Estimation) is the most widely used deep learning optimizer. It combines momentum (first moment) with per-parameter adaptive learning rates (second moment), making it robust to hyperparameter choices. Most modern models — GPT, BERT, Stable Diffusion — are trained with Adam or its variant AdamW.

Adam — Kingma & Ba (2015) $$m_t = \beta_1 m_{t-1} + (1-\beta_1)\nabla\mathcal{L}(\theta_t) \quad \text{(1st moment)}$$ $$v_t = \beta_2 v_{t-1} + (1-\beta_2)(\nabla\mathcal{L}(\theta_t))^2 \quad \text{(2nd moment)}$$ $$\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t} \quad \text{(bias correction)}$$ $$\theta_{t+1} = \theta_t - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$ Default hyperparameters: $\beta_1=0.9$, $\beta_2=0.999$, $\epsilon=10^{-8}$, $\eta=10^{-3}$.

⚡ Adam vs SGD vs Momentum — Race to Minimum

Python

class Adam:
    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8):
        self.params = params
        self.lr = lr
        self.beta1, self.beta2 = betas
        self.eps = eps
        self.m = [np.zeros_like(p) for p in params]  # 1st moment
        self.v = [np.zeros_like(p) for p in params]  # 2nd moment
        self.t = 0

    def step(self, grads):
        self.t += 1
        for i, (p, g) in enumerate(zip(self.params, grads)):
            self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * g
            self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * g**2
            m_hat = self.m[i] / (1 - self.beta1**self.t)   # bias correction
            v_hat = self.v[i] / (1 - self.beta2**self.t)
            p -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

⚠️

AdamW: Weight Decay Fix (2019)

Original Adam incorrectly incorporates L2 regularization into the adaptive gradient. AdamW (Loshchilov & Hutter, 2019) decouples weight decay — it's now the standard for training transformers (GPT-3, LLaMA, etc).

Learning Rate Scheduling

Optimization Hyperparameter

A fixed learning rate is rarely optimal. Schedulers dynamically adjust $\eta$ during training — commonly warming up from zero to allow stable early learning, then decaying to converge precisely. Cosine annealing is used in virtually all modern LLM training runs.

Cosine Annealing with Warm Restart (Loshchilov & Hutter, 2017) $$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{T_{cur}}{T_{max}}\pi\right)\right)$$

⚡ Learning Rate Schedules — Comparison