Cheat sheet

Part 9 · Support Vector Machines — Cheat Sheet

The margin maximiser. Linear vs kernel SVMs, the kernel trick, the C / γ trade-off, and when SVM still beats trees.

1

The margin idea

Classification = find a hyperplane that separates two classes.

Many hyperplanes can do it. SVM picks the one with the maximum margin — the largest possible buffer to the nearest training points.

Why? More margin → better generalisation. A boundary that just squeaks between classes is more likely to fail on slightly shifted test data.

The points that sit on the margin are the support vectors — they alone define the boundary. Remove a non-support-vector point: the boundary doesn't move.

2

Soft margin (the C knob)

Real data isn't linearly separable. The soft-margin SVM allows some points to violate the margin, with a penalty controlled by C:

CBehaviour
Large CHard-margin-ish. Few violations allowed. High variance, overfits.
Small CMany violations allowed. High bias, smoother boundary.

Tune C by cross-validation. Typical range: 0.1 to 100, log-spaced.

C is the inverse of regularisation strength — bigger C = less regularisation.

3

The kernel trick

Linear SVMs find linear boundaries. But many problems need curves.

The trick: map data to a higher-dimensional space where it is linearly separable, then find the hyperplane there.

The deeper trick: you never actually compute the high-dimensional features. The whole SVM math depends only on dot products between points — so you replace x_i · x_j with K(x_i, x_j), a kernel function that returns the dot product in the high-dim space.

K(x, y) = φ(x) · φ(y) — compute the dot product without computing φ.

4

Kernels you'll meet

KernelFormulaUse when
Linearx · yHigh-dim, sparse data (text). Often fine.
Polynomial(γ x·y + r)^dPolynomial decision boundaries of known degree.
RBF (Gaussian)exp(−γ ‖x−y‖²)The default non-linear kernel. Local, smooth.
Sigmoidtanh(γ x·y + r)NN-like, rarely useful in practice.

RBF is the default for non-linear problems. It maps points based on distance — closer = more similar.

5

The γ knob (RBF)

For RBF: γ controls how far the influence of a single training point reaches.

γEffect
Large γNarrow influence. Boundary wiggles around each point. Overfit risk.
Small γWide influence. Boundary smoother. Underfit risk.

γ and C interact strongly. Always grid-search both together:

from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

params = {"C": [0.1, 1, 10, 100], "gamma": [0.001, 0.01, 0.1, 1]}
grid = GridSearchCV(SVC(kernel="rbf"), params, cv=5)
6

When SVM wins

  • Small to medium datasets (≤ 100k samples).
  • High-dimensional features with clear separability (text TF-IDF, gene expression).
  • Clear margin between classes.
  • Non-linear but smooth decision boundary with RBF.
  • Need a stable model that doesn't depend on random seeds (no randomness in the optimisation).
7

When SVM loses

  • Large datasets (> 100k–1M samples). Training is O(n²) to O(n³) — painful.
  • Probabilistic output needed. SVM doesn't naturally give probabilities; probability=True uses an extra calibration pass.
  • Mixed feature types with messy preprocessing. Trees handle this better.
  • Streaming / online learning. SVMs need full batch.
  • Highly imbalanced classes. Use class_weight="balanced" or switch model family.
8

Preprocessing for SVM

Scaling is non-negotiable — SVM uses distances. Without scaling, the largest-magnitude feature dominates.

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

pipe = make_pipeline(
    StandardScaler(),
    SVC(kernel="rbf", C=1.0, gamma="scale"),
)

Other tips:

  • For text: LinearSVC is faster than SVC(kernel='linear').
  • For multiclass: SVC uses one-vs-one by default; LinearSVC uses one-vs-rest.
  • For probability: set probability=True (slower, calibrates via Platt scaling).