Network Architecture
A feedforward neural network (MLP) is composed of layers of neurons. Each neuron computes a weighted sum of its inputs, adds a bias, then applies a non-linear activation function. The Universal Approximation Theorem (Cybenko, 1989; Hornik, 1991) guarantees that a single hidden layer with enough neurons can approximate any continuous function.
$W^{(l)} \in \mathbb{R}^{h_l \times h_{l-1}}$ are weights, $b^{(l)} \in \mathbb{R}^{h_l}$ are biases, $\sigma$ is the activation function. $a^{(0)} = x$ (input).
Forward Propagation
Forward propagation is simply matrix multiplication followed by an activation function, repeated for each layer. For a batch of $n$ inputs $X \in \mathbb{R}^{n \times d}$, each layer transforms the representation. The final layer produces predictions.
import numpy as np
def forward(X, weights, biases, activation=np.tanh):
"""
Multi-layer forward pass.
weights: list of weight matrices [W1, W2, ..., WL]
biases: list of bias vectors [b1, b2, ..., bL]
Returns: final output and list of activations (for backprop)
"""
activations = [X]
a = X
for W, b in zip(weights[:-1], biases[:-1]):
z = a @ W.T + b # z = Wa + b
a = activation(z) # non-linearity
activations.append(a)
# Last layer (no activation for regression; softmax for multiclass)
z_out = a @ weights[-1].T + biases[-1]
activations.append(z_out)
return z_out, activations
Backpropagation
Backpropagation is the chain rule applied recursively through the network. It computes gradients of the loss with respect to every weight in $O(n \cdot \text{params})$ time — the same cost as a single forward pass. This efficiency unlocked the training of deep networks.
$\delta^{(l)}$ propagates the error backwards through each layer. The key insight: gradients of early layers depend on all later layers via the chain rule.
Vanishing Gradient Problem
With sigmoid or tanh activations, gradients decay exponentially through deep networks — making early layers learn extremely slowly. Solutions: ReLU activations, skip connections (ResNets), batch normalization, and careful weight initialization (He, Xavier).
Activation Functions
Without non-linear activation functions, stacked linear layers collapse into a single linear transformation. Activation functions are what give neural networks their expressive power. The choice of activation profoundly affects training dynamics.
| Activation | Formula | Range | Gradient | Used In |
|---|---|---|---|---|
| ReLU | max(0, x) | [0, ∞) | 0 or 1 | Most CNNs |
| GELU | x·Φ(x) | (-0.17, ∞) | Smooth | GPT, BERT |
| SiLU/Swish | x·σ(x) | (-0.28, ∞) | Smooth | EfficientNet, LLaMA |
| Sigmoid | 1/(1+e⁻ˣ) | (0, 1) | Saturates | Output (binary) |
| Tanh | (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ) | (-1, 1) | Saturates | RNNs |
Regularization
Regularization techniques prevent overfitting — the phenomenon where a model performs well on training data but poorly on unseen data. Modern neural networks have billions of parameters; without regularization they memorize training data.
$\lambda$ controls regularization strength. L2 encourages small weights without sparsity. L1 regularization ($\lambda \sum |W|$) promotes sparsity. Dropout randomly zeroes neurons with probability $p$ at training time — acting as a form of ensemble learning.
Batch Normalization (Ioffe & Szegedy, 2015)
Normalizes activations within a mini-batch: $\hat{z} = \frac{z - \mu}{\sigma + \epsilon}$. Reduces internal covariate shift, allows higher learning rates, acts as mild regularization. Layer Normalization (used in Transformers) normalizes across features rather than the batch dimension — crucial for variable-length sequences.