Maria Aguilera

Three properties — a good feature is all three at once:

Informative — actually correlates with the target. The feature carries signal.
Discriminating — varies enough to separate classes / predict the target.
Independent — adds something the other features don't already give you.

Bad features:

Constant column → no information.
Perfectly correlated with another → redundant.
High cardinality unique IDs → memorisation risk.
Leaky features that contain the target → too good to be true.

Feature engineering is not a one-shot transformation. It's a loop:

Brainstorm features from domain knowledge.
Decide which to create.
Implement transformations.
Train a model with them.
Evaluate — did they help? did they leak?
Repeat with new ideas / removals.

Each iteration narrows down to a feature set that genuinely helps the model. Domain knowledge is the engine of step 1 — and it's irreplaceable.

Common recipes:

Pattern	Example
Datetime extraction	`date` → `weekday`, `month`, `hour`, `is_weekend`
Numeric ratios	`height` + `weight` → `BMI`
Binning	`age` → `young / mid / senior`
Log transform	`price` → `log(price)` to fix skew
Polynomial	`x` → `x², x³` (interaction terms)
Aggregations	`customer_id` → `avg_order`, `n_purchases`
Encoding categoricals	One-hot, ordinal, target (see Part 2)
Text	TF-IDF, embeddings, char n-grams

Each new feature is a hypothesis about what helps prediction. Test it.

Family	How it works	Pros	Cons
Filter	Score each feature in isolation with a stat (correlation, chi², ANOVA, mutual info).	Fast. Model-agnostic. Cheap to run.	Ignores interactions. Univariate blindness.
Wrapper	Wrap a model; add/drop features greedily (RFE, forward/backward).	Captures interactions.	Slow — fits model many times. Overfit-prone.
Embedded	Selection happens during training (Lasso, tree importances).	Best balance of speed + accuracy.	Coupled to the chosen model.

Default: start with filter for sanity, then embedded (Lasso or tree importances) for the real selection.

Embedded methods are mostly regularised regressions. They penalise large weights to prevent overfitting:

Method	Penalty	Effect
Ridge (L2)	$\lambda \sum w_j^2$	Shrinks weights toward 0, never to 0. Keeps all features.
Lasso (L1)	$\lambda \sum \\|w_j\\|$	Shrinks weights and zeroes some out. Acts as feature selector.
Elastic Net	$\alpha\\|w\\|_1 + (1-\alpha)\\|w\\|_2^2$	Best of both. Use when features are correlated.

Why Lasso zeros coefficients but Ridge doesn't: the L1 penalty's diamond shape has corners on the axes. The optimisation lands on those corners, where some weights = 0. The L2 penalty's circle has no corners.

Confusing the two is a classic interview trap.

	Parameter	Hyperparameter
What	Internal value the model learns	Setting you choose before training
Examples	Weights `w`, biases `b`, tree splits	`λ` in Ridge, `k` in KNN, `max_depth` in trees, learning rate in NN
Tuned by	Optimisation on training data	Cross-validation on validation data

How to find optimal λ?

Pick a grid of λ values (often log-spaced: 0.001, 0.01, 0.1, 1, 10).
For each λ, cross-validate the model.
Pick the λ with the best mean validation score.
Re-fit on the full train set with that λ.
Evaluate once on the held-out test set.

Tools: GridSearchCV, RandomizedSearchCV, Optuna.

In real projects, follow this order:

Clean the data first (Part 2).
Engineer 10–20 candidate features from domain knowledge.
Run a filter pass — drop constants, low-variance, highly-correlated pairs.
Fit a Lasso or a tree to get embedded importances.
Drop features below a sensible threshold (importance > 0 for trees, weight ≠ 0 for Lasso).
Cross-validate the model on the reduced feature set.
Iterate. New domain ideas → new features → re-rank.

Embedded methods inside a pipeline + cross-validation = leakage-proof selection.

Resist the urge to be aggressive — sometimes a feature looks weak alone but matters in combination.

Tree models handle redundant features fine. Don't pre-prune aggressively.
Interactions — two weak features can be strong together. Filter methods miss this.
Domain-critical features — keep them even if statistically borderline.
Cost-of-collection features — if it's free to keep, keep it.

Drop confidently when: the feature is constant, leaky, derived from the target, or generates more noise than signal in CV.

Part 3 · Feature Engineering — Cheat Sheet

What makes a 'good' feature

The iterative loop

Feature creation patterns

Selection: 3 families

Regularisation — Ridge vs Lasso

Parameters vs hyperparameters

The practical strategy

When NOT to drop features