Scaled Dot-Product Attention
Self-attention allows every position in a sequence to look at (attend to) every other position. Each token produces three vectors — Query, Key, and Value — through learned linear projections. Attention scores measure query-key compatibility; values are aggregated by these scores.
Division by $\sqrt{d_k}$ prevents dot products from growing too large as dimension increases, which would push softmax into saturation (near-zero gradients).
import numpy as np
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q, K, V: (batch, seq_len, d_k)
mask: optional causal mask (seq_len, seq_len)
"""
d_k = Q.shape[-1]
scores = Q @ K.swapaxes(-2, -1) / np.sqrt(d_k) # (B, n, n)
if mask is not None:
scores = np.where(mask == 0, -1e9, scores) # causal mask
weights = np.softmax(scores, axis=-1)
return weights @ V, weights
Multi-Head Attention
Rather than performing a single attention function, Multi-Head Attention (MHA) runs $h$ attention functions in parallel — each with its own $W_Q, W_K, W_V$ projections of lower dimension $d_k = d_{model}/h$. The outputs are concatenated and projected. This allows the model to attend to different representation subspaces simultaneously.
Each head uses $d_k = d_{model}/h$ dimensions. GPT-3 uses $h=96$ heads, $d_{model}=12288$. Different heads learn to attend to syntax, semantics, coreference, etc.
Transformer Block
A Transformer block wraps MHA in a residual connection + layer normalization, followed by a position-wise Feed-Forward Network (FFN) with another residual + LN. This design is called "Pre-LN" or "Post-LN" depending on where normalization is applied — modern LLMs use Pre-LN for stability.
The FFN hidden dimension is typically $4 \times d_{model}$. Residual connections allow gradients to flow directly and enable very deep stacking (GPT-4 estimated ~120 layers).
Flash Attention (Dao et al., 2022)
Standard attention requires $O(n^2)$ memory to store the attention matrix. Flash Attention recomputes blocks without materializing the full matrix, achieving $O(n)$ memory with the same output. It enables training on sequences of 100k+ tokens. Flash Attention 2 & 3 further optimize GPU utilization.
Positional Encoding
Self-attention is permutation-equivariant — without positional information, the model can't distinguish "cat sat on mat" from "mat sat on cat." Positional encodings inject sequence order into the representation.
Modern LLMs use RoPE (Rotary Position Embedding, Su et al. 2021) which encodes position as a rotation in the complex plane — enabling extrapolation to longer sequences than seen during training and used in LLaMA, Mistral, GPT-NeoX.
Modern LLM Architectures
Modern LLMs are decoder-only Transformers trained with next-token prediction (causal language modeling). Key architectural improvements since the original Transformer include GQA, RoPE, SwiGLU/GELU, RMSNorm, and extended context windows.
| Model | Params | Context | Key Innovations | Year |
|---|---|---|---|---|
| GPT-3 | 175B | 4K | Scale + few-shot | 2020 |
| LLaMA 2 | 7B–70B | 4K | Open weights, GQA | 2023 |
| Mistral 7B | 7B | 32K | Sliding window, GQA | 2023 |
| LLaMA 3 | 8B–405B | 128K | GQA, tiled attention | Recent |
| Gemma 2 | 2B–27B | 8K | Local/global interleaved | Recent |
| DeepSeek-V3 | 671B MoE | 128K | MLA, MoE, FP8 | Recent |
Grouped Query Attention (GQA)
MHA requires $O(n \cdot h \cdot d_k)$ KV-cache memory per layer. In GQA (Ainslie et al. 2023), multiple query heads share a single KV head, reducing memory by $h/g$ times. LLaMA 3 70B uses 8 KV heads with 64 query heads — 8× smaller KV cache than MHA.