| Bagging | Boosting | |
|---|---|---|
| Direction | Parallel — trees independent | Sequential — each tree fixes prior errors |
| Goal | Reduce variance | Reduce bias |
| Aggregation | Average / vote | Weighted sum |
| Example | Random Forest | XGBoost, LightGBM, CatBoost |
| Overfit risk | Low | Higher — needs early stopping |
| Out-of-the-box | Almost always works | Needs tuning |
Bagging vs Boosting
Random Forest
The recipe:
- Sample N rows with replacement (bootstrap) from training data — different sample per tree.
- At each split, only consider a random subset of features (typically
sqrt(d)for classification). - Grow deep trees with no pruning.
- Average predictions across all trees.
Why it works:
- Each tree overfits differently. Averaging cancels the noise.
- Feature subsetting de-correlates trees → averaging variance reduction works better.
Knobs:
n_estimators(more = smoother, slower)max_features(lower = more diversity)max_depth(rarely worth limiting)
Gradient Boosting
The recipe:
- Start with a constant prediction (mean target).
- Compute the residuals (errors).
- Fit a small tree to predict the residuals.
- Add this tree's prediction (× learning rate) to the running prediction.
- Repeat.
Each tree corrects what the previous ones got wrong.
Critical knobs:
| Knob | Effect |
|---|---|
learning_rate | Smaller = more trees needed, more stable. 0.05–0.1 typical. |
n_estimators | Stop early — use early_stopping_rounds on a validation set. |
max_depth | 4–8 is typical. Shallower than RF. |
subsample | Stochastic boosting. Adds variance, prevents overfit. |
colsample_bytree | Like Random Forest's feature subsetting. |
XGBoost vs LightGBM vs CatBoost
| XGBoost | LightGBM | CatBoost | |
|---|---|---|---|
| Tree growth | Level-wise (breadth-first) | Leaf-wise (depth-first, lowest loss) | Symmetric / oblivious |
| Speed | Solid | Fastest on large data | Slower train, fast inference |
| Memory | OK | Smallest footprint | OK |
| Categoricals | Basic | Indexed | Native ordered target encoding |
| Sample efficiency | High | High | Best |
| Overfit risk | Lowest | Highest (leaf-wise) | Low |
Default picks:
- Big data, tight memory → LightGBM
- Lots of categoricals → CatBoost
- Maximum control & regularisation → XGBoost
Missing values & categoricals
Missing values:
- XGBoost, LightGBM: learn the best default direction at each split for missing values. No imputation needed.
- CatBoost: similar handling.
- scikit-learn HistGradientBoosting: native support too.
Categoricals:
- CatBoost: native ordered target encoding — handles high-cardinality without leakage.
- LightGBM: histogram-based categorical splits, pass
categorical_feature=[col_indices]. - XGBoost: modern versions support categoricals; older code needed one-hot.
Overfit & early stopping
Boosting will happily memorise the training data if you let it. Two safeguards:
- Early stopping — track validation loss, stop when it stops improving.
model.fit(
X_train, y_train,
eval_set=[(X_val, y_val)],
early_stopping_rounds=50,
)- Regularisation —
min_child_samples,reg_alpha(L1),reg_lambda(L2),max_depth.
Rule: lower learning_rate + more rounds + early stopping ≫ higher learning_rate + fixed rounds.
When to choose each
| Situation | Pick |
|---|---|
| Robust default, minimal tuning | Random Forest |
| Need probability calibration | RF (boosting probabilities are uncalibrated) |
| Want max leaderboard score | XGBoost or LightGBM |
| Lots of categorical features | CatBoost |
| Massive dataset, low memory | LightGBM |
| Time-series with lag features | XGBoost / LightGBM with rolling-origin CV |
| Need interpretability | Single tree or RF with SHAP |
Feature importance
Boosting models offer two flavours:
- Gain — total impurity reduction by each feature. Most informative.
- Cover — how many samples a feature affects.
- Frequency — how often it was used in a split.
model.feature_importances_ # default = gainFor honest importance under correlated features: SHAP values.
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)