| Pattern | Example |
|---|---|
| Datetime extraction | date → weekday, month, hour, is_weekend |
| Numeric ratios | height + weight → BMI |
| Binning | age → young / mid / senior |
| Log transform | price → log(price) to fix skew |
| Polynomial | x → x², x³ (interaction terms) |
| Aggregations | customer_id → avg_order, n_purchases |
| Encoding categoricals | One-hot, ordinal, target (see Part 2) |
| Text | TF-IDF, embeddings, char n-grams |
What makes a 'good' feature
Three properties — a good feature is all three at once:
- Informative — actually correlates with the target. The feature carries signal.
- Discriminating — varies enough to separate classes / predict the target.
- Independent — adds something the other features don't already give you.
Bad features:
- Constant column → no information.
- Perfectly correlated with another → redundant.
- High cardinality unique IDs → memorisation risk.
- Leaky features that contain the target → too good to be true.
The iterative loop
Feature engineering is not a one-shot transformation. It's a loop:
- Brainstorm features from domain knowledge.
- Decide which to create.
- Implement transformations.
- Train a model with them.
- Evaluate — did they help? did they leak?
- Repeat with new ideas / removals.
Each iteration narrows down to a feature set that genuinely helps the model. Domain knowledge is the engine of step 1 — and it's irreplaceable.
Feature creation patterns
Common recipes:
Each new feature is a hypothesis about what helps prediction. Test it.
Selection: 3 families
| Family | How it works | Pros | Cons |
|---|---|---|---|
| Filter | Score each feature in isolation with a stat (correlation, chi², ANOVA, mutual info). | Fast. Model-agnostic. Cheap to run. | Ignores interactions. Univariate blindness. |
| Wrapper | Wrap a model; add/drop features greedily (RFE, forward/backward). | Captures interactions. | Slow — fits model many times. Overfit-prone. |
| Embedded | Selection happens during training (Lasso, tree importances). | Best balance of speed + accuracy. | Coupled to the chosen model. |
Default: start with filter for sanity, then embedded (Lasso or tree importances) for the real selection.
Regularisation — Ridge vs Lasso
Embedded methods are mostly regularised regressions. They penalise large weights to prevent overfitting:
| Method | Penalty | Effect |
|---|---|---|
| Ridge (L2) | Shrinks weights toward 0, never to 0. Keeps all features. | |
| Lasso (L1) | Shrinks weights and zeroes some out. Acts as feature selector. | |
| Elastic Net | Best of both. Use when features are correlated. |
Why Lasso zeros coefficients but Ridge doesn't: the L1 penalty's diamond shape has corners on the axes. The optimisation lands on those corners, where some weights = 0. The L2 penalty's circle has no corners.
Parameters vs hyperparameters
Confusing the two is a classic interview trap.
| Parameter | Hyperparameter | |
|---|---|---|
| What | Internal value the model learns | Setting you choose before training |
| Examples | Weights w, biases b, tree splits | λ in Ridge, k in KNN, max_depth in trees, learning rate in NN |
| Tuned by | Optimisation on training data | Cross-validation on validation data |
How to find optimal λ?
- Pick a grid of
λvalues (often log-spaced: 0.001, 0.01, 0.1, 1, 10). - For each
λ, cross-validate the model. - Pick the
λwith the best mean validation score. - Re-fit on the full train set with that
λ. - Evaluate once on the held-out test set.
Tools: GridSearchCV, RandomizedSearchCV, Optuna.
The practical strategy
In real projects, follow this order:
- Clean the data first (Part 2).
- Engineer 10–20 candidate features from domain knowledge.
- Run a filter pass — drop constants, low-variance, highly-correlated pairs.
- Fit a Lasso or a tree to get embedded importances.
- Drop features below a sensible threshold (importance > 0 for trees, weight ≠ 0 for Lasso).
- Cross-validate the model on the reduced feature set.
- Iterate. New domain ideas → new features → re-rank.
Embedded methods inside a pipeline + cross-validation = leakage-proof selection.
When NOT to drop features
Resist the urge to be aggressive — sometimes a feature looks weak alone but matters in combination.
- Tree models handle redundant features fine. Don't pre-prune aggressively.
- Interactions — two weak features can be strong together. Filter methods miss this.
- Domain-critical features — keep them even if statistically borderline.
- Cost-of-collection features — if it's free to keep, keep it.
Drop confidently when: the feature is constant, leaky, derived from the target, or generates more noise than signal in CV.