Linear Regression
Linear regression models the output as a weighted sum of input features: $\hat{y} = w^Tx + b$. It minimizes Mean Squared Error (MSE) and has a closed-form solution via the Normal Equation — making it the fastest model to fit when dimensions are moderate.
$X \in \mathbb{R}^{n \times d}$ is the design matrix. The Normal Equation requires $O(d^3)$ matrix inversion — for large $d$, gradient descent is preferred. Ridge Regression adds $\lambda I$ to ensure invertibility: $\hat{w} = (X^TX + \lambda I)^{-1}X^Ty$.
import numpy as np
class LinearRegression:
def fit(self, X, y, lr=0.01, epochs=1000):
n, d = X.shape
self.w = np.zeros(d)
self.b = 0.0
for _ in range(epochs):
y_hat = X @ self.w + self.b
err = y_hat - y
self.w -= lr * (2 / n) * X.T @ err # ∂MSE/∂w
self.b -= lr * (2 / n) * err.sum() # ∂MSE/∂b
def predict(self, X):
return X @ self.w + self.b
Logistic Regression
Logistic Regression models the probability of a binary class using the sigmoid function applied to a linear score. Despite the name, it's a classification model. It minimizes binary cross-entropy (log-loss), yielding calibrated probabilities and interpretable weights.
Perceptron
The Perceptron is the ancestor of all neural networks. It learns a linear decision boundary by updating weights only on misclassified examples. The Perceptron Convergence Theorem guarantees convergence if the data is linearly separable.
Historical Significance
Rosenblatt's Perceptron (1958) was the first algorithm that could learn from examples. Minsky & Papert's 1969 book showed it couldn't solve XOR — triggering the first "AI winter." Multi-layer perceptrons (MLPs) solved this, leading directly to modern deep learning.
Support Vector Machine
SVMs find the maximum-margin hyperplane — the decision boundary that is farthest from all data points. Support vectors are the points that lie on the margin boundary; only these affect the solution. With the kernel trick, SVMs can learn non-linear boundaries without explicitly computing high-dimensional features.
Margin width $= \frac{2}{\|w\|}$. Soft-margin SVM replaces the constraint with $y_i(w^Tx_i + b) \geq 1 - \xi_i$ and adds $C\sum \xi_i$ to the objective. The dual formulation reveals that only support vectors have non-zero $\alpha_i$.
Decision Trees
Decision trees partition the feature space with axis-aligned splits. At each node, they greedily choose the feature and threshold that maximally reduces impurity (Gini or entropy for classification; MSE for regression). They're the building block of Random Forests and Gradient Boosted Trees (XGBoost, LightGBM).
| Method | Base Learner | Combination | Strength |
|---|---|---|---|
| Decision Tree | 1 deep tree | — | Interpretable |
| Random Forest | Many shallow trees | Bagging + avg | Low variance |
| XGBoost / LightGBM | Boosted trees | Additive | State-of-art tabular |
| Extra Trees | Randomized splits | Bagging + avg | Faster, similar RF |