| Feature Selection | Dimensionality Reduction | |
|---|---|---|
| Method | Keep a subset of original features | Combine all features into new ones |
| Interpretability | High — you know what each column means | Low — components are linear combinations |
| Examples | Lasso, RFE, mutual info | PCA, LDA, autoencoders, t-SNE, UMAP |
| Output | Subset of original columns | New, transformed features |
Selection vs Reduction
Both shrink the feature space but in different ways:
Use selection when you need to explain to a stakeholder which features matter. Use reduction when you just want a smaller, denser representation.
Why high dimensions hurt
The curse of dimensionality, in one card:
- Space grows exponentially. Volume of a unit cube grows as
1ᵈbut the corners drift away — all the mass goes to the corners. - Distances concentrate. Every point becomes "far" from every other; nearest-neighbour stops meaning anything.
- Sample density falls. To keep the same data density,
nmust grow exponentially withd. - Overfitting risk explodes. More features than samples = perfect train fit on garbage.
Symptoms: KNN gets worse, distance-based clustering breaks, models overfit despite regularisation.
What PCA actually does
PCA finds new axes — principal components — that:
- Are orthogonal to each other.
- Point in the directions of maximum variance in the data.
- Are ranked — PC1 explains the most variance, PC2 the second most, etc.
Each principal component is a recipe: a weighted sum of the original features. PC1 might be 0.6 × alcohol − 0.3 × acidity + 0.4 × phenols + ....
You then keep the top k components that explain, say, 90 % of the variance — and throw away the rest.
When PCA is pointless
Don't reach for PCA reflexively. It hurts more than it helps when:
- Data is already low-dim. With 5 columns, just use them.
- All variances are similar. No "big" component to extract.
- You need interpretability. Components are messy combinations.
- Non-linear structure dominates. PCA is linear — t-SNE / UMAP / kernel PCA will reveal more.
- Class boundary lives in low-variance directions. PCA optimises for variance, not separability — see card 6.
PCA vs LDA
Both produce new axes. They optimise for different things:
| PCA | LDA | |
|---|---|---|
| Goal | Maximise total variance | Maximise class separability |
| Supervised? | No — ignores labels | Yes — uses class labels |
| Best for | Compression, visualisation | Classification preprocessing |
| Max components | min(n − 1, d) | n_classes − 1 |
LDA can outperform PCA when downstream task is classification — because PCA's "big variance" direction may not be the "separates classes" direction.
Two things that quietly break PCA
1. Unscaled features. PCA finds max-variance axes — a column in millions dominates a column in 0–10. Always StandardScaler first.
2. Class-relevant info in low-variance directions. PCA throws away low-variance components. If a tiny but informative feature carries the class signal, PCA can throw the signal away.
Cross-validate with and without PCA to be sure it's actually helping the downstream model.
How many components?
Three strategies:
- Cumulative explained variance. Plot it; pick the elbow. Common target: keep components that explain 90 % or 95 % of total variance.
- Kaiser criterion. Keep components with eigenvalue > 1 (only for correlation-matrix PCA).
- Downstream CV score. Treat
n_componentsas a hyperparameter; cross-validate.
In practice: n_components=0.95 in scikit-learn lets PCA pick the smallest number that hits the 95 % threshold.
PCA(n_components=0.95) # keeps enough PCs to reach 95% varianceKernel PCA
Plain PCA is linear. Real data often has non-linear structure (curved manifolds, twisted clusters).
Kernel PCA trick: apply PCA in a high-dimensional feature space implicitly via a kernel function (RBF, polynomial), without actually computing the high-dim features.
| Kernel | Use case |
|---|---|
linear | Same as plain PCA. |
rbf | Smooth, curved manifolds. Most common. |
poly | Polynomial relationships of fixed degree. |
sigmoid | Neural-net-like transformations. |
Trade-off: slower than vanilla PCA, harder to interpret, but captures structure plain PCA can't.
For visualisation specifically: t-SNE and UMAP usually beat kernel PCA.