Chapter 4: Neural Networks

Network Architecture

MLP Universal Approximator

A feedforward neural network (MLP) is composed of layers of neurons. Each neuron computes a weighted sum of its inputs, adds a bias, then applies a non-linear activation function. The Universal Approximation Theorem (Cybenko, 1989; Hornik, 1991) guarantees that a single hidden layer with enough neurons can approximate any continuous function.

Layer Computation $$z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}$$ $$a^{(l)} = \sigma(z^{(l)})$$ $W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}}$ are weights, $b^{(l)} \in \mathbb{R}^{h_l}$ are biases, $\sigma$ is the activation function. $a^{(0)} = x$ (input).

⚡ Animated Neural Network — Forward Pass

Layers: 4

Forward Propagation

Inference Matrix Multiply

Forward propagation is simply matrix multiplication followed by an activation function, repeated for each layer. For a batch of $n$ inputs $X \in \mathbb{R}^{n \times d}$, each layer transforms the representation. The final layer produces predictions.

Python (NumPy)

import numpy as np

def forward(X, weights, biases, activation=np.tanh):
    """
    Multi-layer forward pass.
    weights: list of weight matrices [W1, W2, ..., WL]
    biases:  list of bias vectors   [b1, b2, ..., bL]
    Returns: final output and list of activations (for backprop)
    """
    activations = [X]
    a = X
    for W, b in zip(weights[:-1], biases[:-1]):
        z = a @ W.T + b  # z = Wa + b
        a = activation(z)  # non-linearity
        activations.append(a)
    # Last layer (no activation for regression; softmax for multiclass)
    z_out = a @ weights[-1].T + biases[-1]
    activations.append(z_out)
    return z_out, activations

Backpropagation

Training Chain Rule Rumelhart 1986

Backpropagation is the chain rule applied recursively through the network. It computes gradients of the loss with respect to every weight in $O(n \cdot \text{params})$ time — the same cost as a single forward pass. This efficiency unlocked the training of deep networks.

Chain Rule Through Layers $$\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \frac{\partial \mathcal{L}}{\partial a^{(l)}} \cdot \frac{\partial a^{(l)}}{\partial z^{(l)}} \cdot \frac{\partial z^{(l)}}{\partial W^{(l)}}$$ $$\delta^{(l)} = \left(W^{(l+1)T} \delta^{(l+1)}\right) \odot \sigma'(z^{(l)}) \quad \text{(error signal)}$$ $$\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \delta^{(l)} \cdot a^{(l-1)T}$$ $\delta^{(l)}$ propagates the error backwards through each layer. The key insight: gradients of early layers depend on all later layers via the chain rule.

⚠️

Vanishing Gradient Problem

With sigmoid or tanh activations, gradients decay exponentially through deep networks — making early layers learn extremely slowly. Solutions: ReLU activations, skip connections (ResNets), batch normalization, and careful weight initialization (He, Xavier).

⚡ Backpropagation — Gradient Flow Visualization

Activation Functions

Non-linearity Architecture Choice

Without non-linear activation functions, stacked linear layers collapse into a single linear transformation. Activation functions are what give neural networks their expressive power. The choice of activation profoundly affects training dynamics.

⚡ Activation Functions — Interactive Comparison

Activation	Formula	Range	Gradient	Used In
ReLU	max(0, x)	[0, ∞)	0 or 1	Most CNNs
GELU	x·Φ(x)	(-0.17, ∞)	Smooth	GPT, BERT
SiLU/Swish	x·σ(x)	(-0.28, ∞)	Smooth	EfficientNet, LLaMA
Sigmoid	1/(1+e⁻ˣ)	(0, 1)	Saturates	Output (binary)
Tanh	(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)	(-1, 1)	Saturates	RNNs

Regularization

Generalization Training Technique

Regularization techniques prevent overfitting — the phenomenon where a model performs well on training data but poorly on unseen data. Modern neural networks have billions of parameters; without regularization they memorize training data.

L2 Regularization (Weight Decay) $$\mathcal{L}_{total} = \mathcal{L}_{task} + \frac{\lambda}{2}\sum_{l}\|W^{(l)}\|_F^2$$ $\lambda$ controls regularization strength. L2 encourages small weights without sparsity. L1 regularization ($\lambda \sum |W|$) promotes sparsity. Dropout randomly zeroes neurons with probability $p$ at training time — acting as a form of ensemble learning.

🔬

Batch Normalization (Ioffe & Szegedy, 2015)

Normalizes activations within a mini-batch: $\hat{z} = \frac{z - \mu}{\sigma + \epsilon}$. Reduces internal covariate shift, allows higher learning rates, acts as mild regularization. Layer Normalization (used in Transformers) normalizes across features rather than the batch dimension — crucial for variable-length sequences.