Maria Aguilera

	Bagging	Boosting
Direction	Parallel — trees independent	Sequential — each tree fixes prior errors
Goal	Reduce variance	Reduce bias
Aggregation	Average / vote	Weighted sum
Example	Random Forest	XGBoost, LightGBM, CatBoost
Overfit risk	Low	Higher — needs early stopping
Out-of-the-box	Almost always works	Needs tuning

The recipe:

Sample N rows with replacement (bootstrap) from training data — different sample per tree.
At each split, only consider a random subset of features (typically sqrt(d) for classification).
Grow deep trees with no pruning.
Average predictions across all trees.

Why it works:

Each tree overfits differently. Averaging cancels the noise.
Feature subsetting de-correlates trees → averaging variance reduction works better.

Knobs:

The recipe:

Each tree corrects what the previous ones got wrong.

Critical knobs:

Knob	Effect
`learning_rate`	Smaller = more trees needed, more stable. `0.05–0.1` typical.
`n_estimators`	Stop early — use `early_stopping_rounds` on a validation set.
`max_depth`	4–8 is typical. Shallower than RF.
`subsample`	Stochastic boosting. Adds variance, prevents overfit.
`colsample_bytree`	Like Random Forest's feature subsetting.

	XGBoost	LightGBM	CatBoost
Tree growth	Level-wise (breadth-first)	Leaf-wise (depth-first, lowest loss)	Symmetric / oblivious
Speed	Solid	Fastest on large data	Slower train, fast inference
Memory	OK	Smallest footprint	OK
Categoricals	Basic	Indexed	Native ordered target encoding
Sample efficiency	High	High	Best
Overfit risk	Lowest	Highest (leaf-wise)	Low

Default picks:

Missing values:

XGBoost, LightGBM: learn the best default direction at each split for missing values. No imputation needed.
CatBoost: similar handling.
scikit-learn HistGradientBoosting: native support too.

Categoricals:

CatBoost: native ordered target encoding — handles high-cardinality without leakage.
LightGBM: histogram-based categorical splits, pass categorical_feature=[col_indices].
XGBoost: modern versions support categoricals; older code needed one-hot.

Boosting will happily memorise the training data if you let it. Two safeguards:

model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    early_stopping_rounds=50,
)

Regularisation — min_child_samples, reg_alpha (L1), reg_lambda (L2), max_depth.

Rule: lower learning_rate + more rounds + early stopping ≫ higher learning_rate + fixed rounds.

Situation	Pick
Robust default, minimal tuning	Random Forest
Need probability calibration	RF (boosting probabilities are uncalibrated)
Want max leaderboard score	XGBoost or LightGBM
Lots of categorical features	CatBoost
Massive dataset, low memory	LightGBM
Time-series with lag features	XGBoost / LightGBM with rolling-origin CV
Need interpretability	Single tree or RF with SHAP

Boosting models offer two flavours:

model.feature_importances_   # default = gain

For honest importance under correlated features: SHAP values.

import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

Part 8 · Random Forest & Boosting — Cheat Sheet