Gradient Descent
Gradient Descent is the most fundamental optimization algorithm. Given a differentiable loss function $\mathcal{L}(\theta)$, it iteratively moves parameters $\theta$ in the direction of steepest descent (the negative gradient), until it reaches a minimum.
Where $\eta$ is the learning rate, $\nabla_\theta \mathcal{L}$ is the gradient of the loss with respect to parameters $\theta$.
The visualization shows a 2D contour plot of a loss surface with a ball rolling towards the minimum. Adjust the learning rate — too low and convergence is slow; too high and the updates overshoot.
The Learning Rate Problem
A learning rate that's too small leads to slow convergence. Too large, and the parameters oscillate or diverge. This is why adaptive methods like Adam were invented.
import numpy as np
def gradient_descent(loss_fn, grad_fn, theta_init, lr=0.01, epochs=1000):
"""
Vanilla Gradient Descent.
loss_fn : callable, takes theta, returns scalar loss
grad_fn : callable, takes theta, returns gradient array
theta_init: initial parameter vector
lr : learning rate (step size)
"""
theta = theta_init.copy()
history = [theta.copy()]
for t in range(epochs):
grad = grad_fn(theta) # ∇L(θ)
theta -= lr * grad # θ ← θ - η·∇L
history.append(theta.copy())
if np.linalg.norm(grad) < 1e-6: # convergence
break
return theta, history
Stochastic & Mini-Batch GD
Vanilla GD computes the gradient over the entire dataset — expensive for large datasets. SGD updates with one random sample at a time; Mini-Batch GD is the practical sweet spot: compute gradient on a batch of $B$ samples.
$\mathcal{B}$ is a randomly sampled mini-batch of size $B$ from the training set. Modern LLMs use batch sizes of 1M+ tokens.
| Method | Gradient Estimate | Speed / Epoch | Noise | Memory |
|---|---|---|---|---|
| Full GD | Exact | Slow (1 update) | None | High |
| SGD | Noisy (1 sample) | Fast (n updates) | High | Low |
| Mini-Batch GD | Approximate | Balanced | Medium | Medium |
SGD with Momentum
Momentum accelerates SGD by accumulating a velocity vector in directions of persistent gradient. It dampens oscillations and accelerates convergence — analogous to a ball rolling downhill and building up speed in consistent directions.
$\beta \approx 0.9$ is the momentum coefficient. The velocity $v_t$ is an exponentially weighted moving average of past gradients.
Nesterov Accelerated Gradient (NAG)
An improvement on momentum — evaluate the gradient at the lookahead position $\theta - \beta v$ before updating. This gives better convergence guarantees and is used in PyTorch's SGD optimizer.
Adam Optimizer
Adam (Adaptive Moment Estimation) is the most widely used deep learning optimizer. It combines momentum (first moment) with per-parameter adaptive learning rates (second moment), making it robust to hyperparameter choices. Most modern models — GPT, BERT, Stable Diffusion — are trained with Adam or its variant AdamW.
Default hyperparameters: $\beta_1=0.9$, $\beta_2=0.999$, $\epsilon=10^{-8}$, $\eta=10^{-3}$.
class Adam:
def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8):
self.params = params
self.lr = lr
self.beta1, self.beta2 = betas
self.eps = eps
self.m = [np.zeros_like(p) for p in params] # 1st moment
self.v = [np.zeros_like(p) for p in params] # 2nd moment
self.t = 0
def step(self, grads):
self.t += 1
for i, (p, g) in enumerate(zip(self.params, grads)):
self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * g
self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * g**2
m_hat = self.m[i] / (1 - self.beta1**self.t) # bias correction
v_hat = self.v[i] / (1 - self.beta2**self.t)
p -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
AdamW: Weight Decay Fix (2019)
Original Adam incorrectly incorporates L2 regularization into the adaptive gradient. AdamW (Loshchilov & Hutter, 2019) decouples weight decay — it's now the standard for training transformers (GPT-3, LLaMA, etc).
Learning Rate Scheduling
A fixed learning rate is rarely optimal. Schedulers dynamically adjust $\eta$ during training — commonly warming up from zero to allow stable early learning, then decaying to converge precisely. Cosine annealing is used in virtually all modern LLM training runs.