Cheat sheet

Part 3 · Feature Engineering — Cheat Sheet

Picking the features that actually matter. Filter, wrapper, embedded methods and the regularisation maths behind Ridge / Lasso / Elastic Net.

1

What makes a 'good' feature

Three properties — a good feature is all three at once:

  • Informative — actually correlates with the target. The feature carries signal.
  • Discriminating — varies enough to separate classes / predict the target.
  • Independent — adds something the other features don't already give you.

Bad features:

  • Constant column → no information.
  • Perfectly correlated with another → redundant.
  • High cardinality unique IDs → memorisation risk.
  • Leaky features that contain the target → too good to be true.
2

The iterative loop

Feature engineering is not a one-shot transformation. It's a loop:

  1. Brainstorm features from domain knowledge.
  2. Decide which to create.
  3. Implement transformations.
  4. Train a model with them.
  5. Evaluate — did they help? did they leak?
  6. Repeat with new ideas / removals.

Each iteration narrows down to a feature set that genuinely helps the model. Domain knowledge is the engine of step 1 — and it's irreplaceable.

3

Feature creation patterns

Common recipes:

PatternExample
Datetime extractiondateweekday, month, hour, is_weekend
Numeric ratiosheight + weightBMI
Binningageyoung / mid / senior
Log transformpricelog(price) to fix skew
Polynomialxx², x³ (interaction terms)
Aggregationscustomer_idavg_order, n_purchases
Encoding categoricalsOne-hot, ordinal, target (see Part 2)
TextTF-IDF, embeddings, char n-grams

Each new feature is a hypothesis about what helps prediction. Test it.

4

Selection: 3 families

FamilyHow it worksProsCons
FilterScore each feature in isolation with a stat (correlation, chi², ANOVA, mutual info).Fast. Model-agnostic. Cheap to run.Ignores interactions. Univariate blindness.
WrapperWrap a model; add/drop features greedily (RFE, forward/backward).Captures interactions.Slow — fits model many times. Overfit-prone.
EmbeddedSelection happens during training (Lasso, tree importances).Best balance of speed + accuracy.Coupled to the chosen model.

Default: start with filter for sanity, then embedded (Lasso or tree importances) for the real selection.

5

Regularisation — Ridge vs Lasso

Embedded methods are mostly regularised regressions. They penalise large weights to prevent overfitting:

MethodPenaltyEffect
Ridge (L2)λwj2\lambda \sum w_j^2Shrinks weights toward 0, never to 0. Keeps all features.
Lasso (L1)λwj\lambda \sum \|w_j\|Shrinks weights and zeroes some out. Acts as feature selector.
Elastic Netαw1+(1α)w22\alpha\|w\|_1 + (1-\alpha)\|w\|_2^2Best of both. Use when features are correlated.

Why Lasso zeros coefficients but Ridge doesn't: the L1 penalty's diamond shape has corners on the axes. The optimisation lands on those corners, where some weights = 0. The L2 penalty's circle has no corners.

6

Parameters vs hyperparameters

Confusing the two is a classic interview trap.

ParameterHyperparameter
WhatInternal value the model learnsSetting you choose before training
ExamplesWeights w, biases b, tree splitsλ in Ridge, k in KNN, max_depth in trees, learning rate in NN
Tuned byOptimisation on training dataCross-validation on validation data

How to find optimal λ?

  1. Pick a grid of λ values (often log-spaced: 0.001, 0.01, 0.1, 1, 10).
  2. For each λ, cross-validate the model.
  3. Pick the λ with the best mean validation score.
  4. Re-fit on the full train set with that λ.
  5. Evaluate once on the held-out test set.

Tools: GridSearchCV, RandomizedSearchCV, Optuna.

7

The practical strategy

In real projects, follow this order:

  1. Clean the data first (Part 2).
  2. Engineer 10–20 candidate features from domain knowledge.
  3. Run a filter pass — drop constants, low-variance, highly-correlated pairs.
  4. Fit a Lasso or a tree to get embedded importances.
  5. Drop features below a sensible threshold (importance > 0 for trees, weight ≠ 0 for Lasso).
  6. Cross-validate the model on the reduced feature set.
  7. Iterate. New domain ideas → new features → re-rank.

Embedded methods inside a pipeline + cross-validation = leakage-proof selection.

8

When NOT to drop features

Resist the urge to be aggressive — sometimes a feature looks weak alone but matters in combination.

  • Tree models handle redundant features fine. Don't pre-prune aggressively.
  • Interactions — two weak features can be strong together. Filter methods miss this.
  • Domain-critical features — keep them even if statistically borderline.
  • Cost-of-collection features — if it's free to keep, keep it.

Drop confidently when: the feature is constant, leaky, derived from the target, or generates more noise than signal in CV.