Home / Chapter 3
Chapter 3 · Supervised

Linear Models

Linear models are the bedrock of supervised machine learning. Despite their simplicity, they remain competitive — Logistic Regression is still a strong baseline, and SVMs excel in high dimensions. Understanding their geometry gives insight into all of modern deep learning.

5 algorithms
Interactive visualizations
Updated Latest

Linear Regression

Regression Supervised O(n·d²)

Linear regression models the output as a weighted sum of input features: $\hat{y} = w^Tx + b$. It minimizes Mean Squared Error (MSE) and has a closed-form solution via the Normal Equation — making it the fastest model to fit when dimensions are moderate.

Normal Equation (Closed Form)
$$\hat{w} = (X^TX)^{-1}X^Ty$$

$X \in \mathbb{R}^{n \times d}$ is the design matrix. The Normal Equation requires $O(d^3)$ matrix inversion — for large $d$, gradient descent is preferred. Ridge Regression adds $\lambda I$ to ensure invertibility: $\hat{w} = (X^TX + \lambda I)^{-1}X^Ty$.

⚡ Interactive: Fit a Line with Gradient Descent
Python
import numpy as np

class LinearRegression:
    def fit(self, X, y, lr=0.01, epochs=1000):
        n, d = X.shape
        self.w = np.zeros(d)
        self.b = 0.0
        for _ in range(epochs):
            y_hat = X @ self.w + self.b
            err = y_hat - y
            self.w -= lr * (2 / n) * X.T @ err   # ∂MSE/∂w
            self.b -= lr * (2 / n) * err.sum()  # ∂MSE/∂b

    def predict(self, X):
        return X @ self.w + self.b

Logistic Regression

Classification Supervised Probabilistic

Logistic Regression models the probability of a binary class using the sigmoid function applied to a linear score. Despite the name, it's a classification model. It minimizes binary cross-entropy (log-loss), yielding calibrated probabilities and interpretable weights.

Logistic Regression Model
$$\hat{p} = \sigma(w^Tx + b) = \frac{1}{1 + e^{-(w^Tx + b)}}$$ $$\mathcal{L}_{BCE} = -\frac{1}{n}\sum_{i=1}^n \left[y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\right]$$
⚡ Interactive: Decision Boundary — Click to Add Points

Perceptron

Classification Online Learning Rosenblatt 1958

The Perceptron is the ancestor of all neural networks. It learns a linear decision boundary by updating weights only on misclassified examples. The Perceptron Convergence Theorem guarantees convergence if the data is linearly separable.

Perceptron Update Rule
$$\hat{y} = \text{sign}(w^Tx + b)$$ $$\text{if } y_i \neq \hat{y}_i: \quad w \leftarrow w + y_i x_i, \quad b \leftarrow b + y_i$$
🧠
Historical Significance

Rosenblatt's Perceptron (1958) was the first algorithm that could learn from examples. Minsky & Papert's 1969 book showed it couldn't solve XOR — triggering the first "AI winter." Multi-layer perceptrons (MLPs) solved this, leading directly to modern deep learning.

Support Vector Machine

Classification Max-Margin Vapnik 1963

SVMs find the maximum-margin hyperplane — the decision boundary that is farthest from all data points. Support vectors are the points that lie on the margin boundary; only these affect the solution. With the kernel trick, SVMs can learn non-linear boundaries without explicitly computing high-dimensional features.

Hard-Margin SVM (Primal)
$$\min_{w,b} \frac{1}{2}\|w\|^2 \quad\text{s.t.}\quad y_i(w^Tx_i + b) \geq 1 \;\forall i$$

Margin width $= \frac{2}{\|w\|}$. Soft-margin SVM replaces the constraint with $y_i(w^Tx_i + b) \geq 1 - \xi_i$ and adds $C\sum \xi_i$ to the objective. The dual formulation reveals that only support vectors have non-zero $\alpha_i$.

⚡ SVM Margin Visualization
C: 1.0

Decision Trees

Classification Regression Interpretable

Decision trees partition the feature space with axis-aligned splits. At each node, they greedily choose the feature and threshold that maximally reduces impurity (Gini or entropy for classification; MSE for regression). They're the building block of Random Forests and Gradient Boosted Trees (XGBoost, LightGBM).

Gini Impurity
$$\text{Gini}(p) = 1 - \sum_{k=1}^K p_k^2$$ $$\text{Information Gain} = \text{Gini}(\mathcal{D}) - \frac{|\mathcal{D}_L|}{|\mathcal{D}|}\text{Gini}(\mathcal{D}_L) - \frac{|\mathcal{D}_R|}{|\mathcal{D}|}\text{Gini}(\mathcal{D}_R)$$
MethodBase LearnerCombinationStrength
Decision Tree1 deep treeInterpretable
Random ForestMany shallow treesBagging + avgLow variance
XGBoost / LightGBMBoosted treesAdditiveState-of-art tabular
Extra TreesRandomized splitsBagging + avgFaster, similar RF