Cheat sheet

Part 10 · PCA & Dimensionality Reduction — Cheat Sheet

Principal Component Analysis end-to-end. Why it works, when it breaks, how it differs from LDA, and the practical workflow.

1

Selection vs Reduction

Both shrink the feature space but in different ways:

Feature SelectionDimensionality Reduction
MethodKeep a subset of original featuresCombine all features into new ones
InterpretabilityHigh — you know what each column meansLow — components are linear combinations
ExamplesLasso, RFE, mutual infoPCA, LDA, autoencoders, t-SNE, UMAP
OutputSubset of original columnsNew, transformed features

Use selection when you need to explain to a stakeholder which features matter. Use reduction when you just want a smaller, denser representation.

2

Why high dimensions hurt

The curse of dimensionality, in one card:

  • Space grows exponentially. Volume of a unit cube grows as 1ᵈ but the corners drift away — all the mass goes to the corners.
  • Distances concentrate. Every point becomes "far" from every other; nearest-neighbour stops meaning anything.
  • Sample density falls. To keep the same data density, n must grow exponentially with d.
  • Overfitting risk explodes. More features than samples = perfect train fit on garbage.

Symptoms: KNN gets worse, distance-based clustering breaks, models overfit despite regularisation.

3

What PCA actually does

PCA finds new axes — principal components — that:

  1. Are orthogonal to each other.
  2. Point in the directions of maximum variance in the data.
  3. Are ranked — PC1 explains the most variance, PC2 the second most, etc.

Each principal component is a recipe: a weighted sum of the original features. PC1 might be 0.6 × alcohol − 0.3 × acidity + 0.4 × phenols + ....

You then keep the top k components that explain, say, 90 % of the variance — and throw away the rest.

4

When PCA is pointless

Don't reach for PCA reflexively. It hurts more than it helps when:

  • Data is already low-dim. With 5 columns, just use them.
  • All variances are similar. No "big" component to extract.
  • You need interpretability. Components are messy combinations.
  • Non-linear structure dominates. PCA is linear — t-SNE / UMAP / kernel PCA will reveal more.
  • Class boundary lives in low-variance directions. PCA optimises for variance, not separability — see card 6.
5

PCA vs LDA

Both produce new axes. They optimise for different things:

PCALDA
GoalMaximise total varianceMaximise class separability
Supervised?No — ignores labelsYes — uses class labels
Best forCompression, visualisationClassification preprocessing
Max componentsmin(n − 1, d)n_classes − 1

LDA can outperform PCA when downstream task is classification — because PCA's "big variance" direction may not be the "separates classes" direction.

6

Two things that quietly break PCA

1. Unscaled features. PCA finds max-variance axes — a column in millions dominates a column in 0–10. Always StandardScaler first.

2. Class-relevant info in low-variance directions. PCA throws away low-variance components. If a tiny but informative feature carries the class signal, PCA can throw the signal away.

Cross-validate with and without PCA to be sure it's actually helping the downstream model.

7

How many components?

Three strategies:

  1. Cumulative explained variance. Plot it; pick the elbow. Common target: keep components that explain 90 % or 95 % of total variance.
  2. Kaiser criterion. Keep components with eigenvalue > 1 (only for correlation-matrix PCA).
  3. Downstream CV score. Treat n_components as a hyperparameter; cross-validate.

In practice: n_components=0.95 in scikit-learn lets PCA pick the smallest number that hits the 95 % threshold.

PCA(n_components=0.95)  # keeps enough PCs to reach 95% variance
8

Kernel PCA

Plain PCA is linear. Real data often has non-linear structure (curved manifolds, twisted clusters).

Kernel PCA trick: apply PCA in a high-dimensional feature space implicitly via a kernel function (RBF, polynomial), without actually computing the high-dim features.

KernelUse case
linearSame as plain PCA.
rbfSmooth, curved manifolds. Most common.
polyPolynomial relationships of fixed degree.
sigmoidNeural-net-like transformations.

Trade-off: slower than vanilla PCA, harder to interpret, but captures structure plain PCA can't.

For visualisation specifically: t-SNE and UMAP usually beat kernel PCA.