Maria Aguilera

Both shrink the feature space but in different ways:

	Feature Selection	Dimensionality Reduction
Method	Keep a subset of original features	Combine all features into new ones
Interpretability	High — you know what each column means	Low — components are linear combinations
Examples	Lasso, RFE, mutual info	PCA, LDA, autoencoders, t-SNE, UMAP
Output	Subset of original columns	New, transformed features

Use selection when you need to explain to a stakeholder which features matter. Use reduction when you just want a smaller, denser representation.

The curse of dimensionality, in one card:

Space grows exponentially. Volume of a unit cube grows as 1ᵈ but the corners drift away — all the mass goes to the corners.
Distances concentrate. Every point becomes "far" from every other; nearest-neighbour stops meaning anything.
Sample density falls. To keep the same data density, n must grow exponentially with d.
Overfitting risk explodes. More features than samples = perfect train fit on garbage.

Symptoms: KNN gets worse, distance-based clustering breaks, models overfit despite regularisation.

PCA finds new axes — principal components — that:

Are orthogonal to each other.
Point in the directions of maximum variance in the data.
Are ranked — PC1 explains the most variance, PC2 the second most, etc.

Each principal component is a recipe: a weighted sum of the original features. PC1 might be 0.6 × alcohol − 0.3 × acidity + 0.4 × phenols + ....

You then keep the top k components that explain, say, 90 % of the variance — and throw away the rest.

Don't reach for PCA reflexively. It hurts more than it helps when:

Data is already low-dim. With 5 columns, just use them.
All variances are similar. No "big" component to extract.
You need interpretability. Components are messy combinations.
Non-linear structure dominates. PCA is linear — t-SNE / UMAP / kernel PCA will reveal more.
Class boundary lives in low-variance directions. PCA optimises for variance, not separability — see card 6.

Both produce new axes. They optimise for different things:

	PCA	LDA
Goal	Maximise total variance	Maximise class separability
Supervised?	No — ignores labels	Yes — uses class labels
Best for	Compression, visualisation	Classification preprocessing
Max components	`min(n − 1, d)`	`n_classes − 1`

LDA can outperform PCA when downstream task is classification — because PCA's "big variance" direction may not be the "separates classes" direction.

1. Unscaled features. PCA finds max-variance axes — a column in millions dominates a column in 0–10. Always StandardScaler first.

2. Class-relevant info in low-variance directions. PCA throws away low-variance components. If a tiny but informative feature carries the class signal, PCA can throw the signal away.

Cross-validate with and without PCA to be sure it's actually helping the downstream model.

Three strategies:

Cumulative explained variance. Plot it; pick the elbow. Common target: keep components that explain 90 % or 95 % of total variance.
Kaiser criterion. Keep components with eigenvalue > 1 (only for correlation-matrix PCA).
Downstream CV score. Treat n_components as a hyperparameter; cross-validate.

In practice: n_components=0.95 in scikit-learn lets PCA pick the smallest number that hits the 95 % threshold.

PCA(n_components=0.95)  # keeps enough PCs to reach 95% variance

Plain PCA is linear. Real data often has non-linear structure (curved manifolds, twisted clusters).

Kernel PCA trick: apply PCA in a high-dimensional feature space implicitly via a kernel function (RBF, polynomial), without actually computing the high-dim features.

Kernel	Use case
`linear`	Same as plain PCA.
`rbf`	Smooth, curved manifolds. Most common.
`poly`	Polynomial relationships of fixed degree.
`sigmoid`	Neural-net-like transformations.

Trade-off: slower than vanilla PCA, harder to interpret, but captures structure plain PCA can't.

For visualisation specifically: t-SNE and UMAP usually beat kernel PCA.

The workflow

Scale the data with StandardScaler (non-negotiable).
Fit PCA on the training set only.
Decide n_components by cumulative variance or CV.
Transform train and test using the same fitted PCA.
Cross-validate the downstream model with PCA inside the pipeline — never separately, or you leak.
Compare to a baseline without PCA. Sometimes the original features win.

pipe = make_pipeline(
    StandardScaler(),
    PCA(n_components=0.95),
    LogisticRegression(),
)
cross_val_score(pipe, X, y, cv=5)

Part 10 · PCA & Dimensionality Reduction — Cheat Sheet

Selection vs Reduction

Why high dimensions hurt

What PCA actually does

When PCA is pointless

PCA vs LDA

Two things that quietly break PCA

How many components?

Kernel PCA