Home / Chapter 5
Chapter 5 · Modern Deep Learning

Transformers & Attention

"Attention Is All You Need" (Vaswani et al., 2017) introduced the Transformer — now the dominant architecture for NLP, vision, code, protein folding, and more. Understanding self-attention is the key to understanding GPT, BERT, ViT, Whisper, and virtually all modern AI.

5 topics
Attention heatmap
Updated Latest

Scaled Dot-Product Attention

Attention O(n²·d) Vaswani et al. 2017

Self-attention allows every position in a sequence to look at (attend to) every other position. Each token produces three vectors — Query, Key, and Value — through learned linear projections. Attention scores measure query-key compatibility; values are aggregated by these scores.

Scaled Dot-Product Attention
$$Q = XW_Q, \quad K = XW_K, \quad V = XW_V$$ $$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Division by $\sqrt{d_k}$ prevents dot products from growing too large as dimension increases, which would push softmax into saturation (near-zero gradients).

⚡ Attention Heatmap Visualization
Temp: 1.0
Python (NumPy)
import numpy as np

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q, K, V: (batch, seq_len, d_k)
    mask: optional causal mask (seq_len, seq_len)
    """
    d_k = Q.shape[-1]
    scores = Q @ K.swapaxes(-2, -1) / np.sqrt(d_k)  # (B, n, n)
    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores)   # causal mask
    weights = np.softmax(scores, axis=-1)
    return weights @ V, weights

Multi-Head Attention

Attention Parallel

Rather than performing a single attention function, Multi-Head Attention (MHA) runs $h$ attention functions in parallel — each with its own $W_Q, W_K, W_V$ projections of lower dimension $d_k = d_{model}/h$. The outputs are concatenated and projected. This allows the model to attend to different representation subspaces simultaneously.

Multi-Head Attention
$$\text{head}_i = \text{Attention}(XW_Q^i, XW_K^i, XW_V^i)$$ $$\text{MHA}(X) = \text{Concat}(\text{head}_1,\ldots,\text{head}_h)W_O$$

Each head uses $d_k = d_{model}/h$ dimensions. GPT-3 uses $h=96$ heads, $d_{model}=12288$. Different heads learn to attend to syntax, semantics, coreference, etc.

⚡ Multi-Head Attention — Head Comparison
Head: 1

Transformer Block

Architecture Residual

A Transformer block wraps MHA in a residual connection + layer normalization, followed by a position-wise Feed-Forward Network (FFN) with another residual + LN. This design is called "Pre-LN" or "Post-LN" depending on where normalization is applied — modern LLMs use Pre-LN for stability.

Transformer Block (Pre-LN variant)
$$x' = x + \text{MHA}(\text{LayerNorm}(x))$$ $$x'' = x' + \text{FFN}(\text{LayerNorm}(x'))$$ $$\text{FFN}(x) = \text{GELU}(xW_1 + b_1)W_2 + b_2$$

The FFN hidden dimension is typically $4 \times d_{model}$. Residual connections allow gradients to flow directly and enable very deep stacking (GPT-4 estimated ~120 layers).

🔬
Flash Attention (Dao et al., 2022)

Standard attention requires $O(n^2)$ memory to store the attention matrix. Flash Attention recomputes blocks without materializing the full matrix, achieving $O(n)$ memory with the same output. It enables training on sequences of 100k+ tokens. Flash Attention 2 & 3 further optimize GPU utilization.

Positional Encoding

Architecture Sequence Order

Self-attention is permutation-equivariant — without positional information, the model can't distinguish "cat sat on mat" from "mat sat on cat." Positional encodings inject sequence order into the representation.

Sinusoidal Encoding (Vaswani 2017)
$$PE(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad PE(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)$$

Modern LLMs use RoPE (Rotary Position Embedding, Su et al. 2021) which encodes position as a rotation in the complex plane — enabling extrapolation to longer sequences than seen during training and used in LLaMA, Mistral, GPT-NeoX.

⚡ Sinusoidal Positional Encoding Heatmap

Modern LLM Architectures

GPT LLaMA Recent Era

Modern LLMs are decoder-only Transformers trained with next-token prediction (causal language modeling). Key architectural improvements since the original Transformer include GQA, RoPE, SwiGLU/GELU, RMSNorm, and extended context windows.

ModelParamsContextKey InnovationsYear
GPT-3175B4KScale + few-shot2020
LLaMA 27B–70B4KOpen weights, GQA2023
Mistral 7B7B32KSliding window, GQA2023
LLaMA 38B–405B128KGQA, tiled attentionRecent
Gemma 22B–27B8KLocal/global interleavedRecent
DeepSeek-V3671B MoE128KMLA, MoE, FP8Recent
🚀
Grouped Query Attention (GQA)

MHA requires $O(n \cdot h \cdot d_k)$ KV-cache memory per layer. In GQA (Ainslie et al. 2023), multiple query heads share a single KV head, reducing memory by $h/g$ times. LLaMA 3 70B uses 8 KV heads with 64 query heads — 8× smaller KV cache than MHA.