Chapter 5: Transformers & Attention

Scaled Dot-Product Attention

Attention O(n²·d) Vaswani et al. 2017

Self-attention allows every position in a sequence to look at (attend to) every other position. Each token produces three vectors — Query, Key, and Value — through learned linear projections. Attention scores measure query-key compatibility; values are aggregated by these scores.

Scaled Dot-Product Attention $$Q = XW_Q, \quad K = XW_K, \quad V = XW_V$$ $$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ Division by $\sqrt{d_k}$ prevents dot products from growing too large as dimension increases, which would push softmax into saturation (near-zero gradients).

⚡ Attention Heatmap Visualization

Temp: 1.0

Python (NumPy)

import numpy as np

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q, K, V: (batch, seq_len, d_k)
    mask: optional causal mask (seq_len, seq_len)
    """
    d_k = Q.shape[-1]
    scores = Q @ K.swapaxes(-2, -1) / np.sqrt(d_k)  # (B, n, n)
    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores)   # causal mask
    weights = np.softmax(scores, axis=-1)
    return weights @ V, weights

Multi-Head Attention

Attention Parallel

Rather than performing a single attention function, Multi-Head Attention (MHA) runs $h$ attention functions in parallel — each with its own $W_Q, W_K, W_V$ projections of lower dimension $d_k = d_{model}/h$. The outputs are concatenated and projected. This allows the model to attend to different representation subspaces simultaneously.

Multi-Head Attention $$\text{head}_i = \text{Attention}(XW_Q^i, XW_K^i, XW_V^i)$$ $$\text{MHA}(X) = \text{Concat}(\text{head}_1,\ldots,\text{head}_h)W_O$$ Each head uses $d_k = d_{model}/h$ dimensions. GPT-3 uses $h=96$ heads, $d_{model}=12288$. Different heads learn to attend to syntax, semantics, coreference, etc.

⚡ Multi-Head Attention — Head Comparison

Head: 1

Transformer Block

Architecture Residual

A Transformer block wraps MHA in a residual connection + layer normalization, followed by a position-wise Feed-Forward Network (FFN) with another residual + LN. This design is called "Pre-LN" or "Post-LN" depending on where normalization is applied — modern LLMs use Pre-LN for stability.

Transformer Block (Pre-LN variant) $$x' = x + \text{MHA}(\text{LayerNorm}(x))$$ $$x'' = x' + \text{FFN}(\text{LayerNorm}(x'))$$ $$\text{FFN}(x) = \text{GELU}(xW_1 + b_1)W_2 + b_2$$ The FFN hidden dimension is typically $4 \times d_{model}$. Residual connections allow gradients to flow directly and enable very deep stacking (GPT-4 estimated ~120 layers).

🔬

Flash Attention (Dao et al., 2022)

Standard attention requires $O(n^2)$ memory to store the attention matrix. Flash Attention recomputes blocks without materializing the full matrix, achieving $O(n)$ memory with the same output. It enables training on sequences of 100k+ tokens. Flash Attention 2 & 3 further optimize GPU utilization.

Positional Encoding

Architecture Sequence Order

Self-attention is permutation-equivariant — without positional information, the model can't distinguish "cat sat on mat" from "mat sat on cat." Positional encodings inject sequence order into the representation.

Sinusoidal Encoding (Vaswani 2017) $$PE(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad PE(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)$$ Modern LLMs use RoPE (Rotary Position Embedding, Su et al. 2021) which encodes position as a rotation in the complex plane — enabling extrapolation to longer sequences than seen during training and used in LLaMA, Mistral, GPT-NeoX.

⚡ Sinusoidal Positional Encoding Heatmap

Modern LLM Architectures

GPT LLaMA Recent Era

Modern LLMs are decoder-only Transformers trained with next-token prediction (causal language modeling). Key architectural improvements since the original Transformer include GQA, RoPE, SwiGLU/GELU, RMSNorm, and extended context windows.

Model	Params	Context	Key Innovations	Year
GPT-3	175B	4K	Scale + few-shot	2020
LLaMA 2	7B–70B	4K	Open weights, GQA	2023
Mistral 7B	7B	32K	Sliding window, GQA	2023
LLaMA 3	8B–405B	128K	GQA, tiled attention	Recent
Gemma 2	2B–27B	8K	Local/global interleaved	Recent
DeepSeek-V3	671B MoE	128K	MLA, MoE, FP8	Recent

🚀

Grouped Query Attention (GQA)

MHA requires $O(n \cdot h \cdot d_k)$ KV-cache memory per layer. In GQA (Ainslie et al. 2023), multiple query heads share a single KV head, reducing memory by $h/g$ times. LLaMA 3 70B uses 8 KV heads with 64 query heads — 8× smaller KV cache than MHA.