Cheat sheet

Part 8 · Random Forest & Boosting — Cheat Sheet

Bagging vs boosting, why Random Forest wins out of the box, and how XGBoost / LightGBM / CatBoost differ under the hood.

1

Bagging vs Boosting

BaggingBoosting
DirectionParallel — trees independentSequential — each tree fixes prior errors
GoalReduce varianceReduce bias
AggregationAverage / voteWeighted sum
ExampleRandom ForestXGBoost, LightGBM, CatBoost
Overfit riskLowHigher — needs early stopping
Out-of-the-boxAlmost always worksNeeds tuning
2

Random Forest

The recipe:

  1. Sample N rows with replacement (bootstrap) from training data — different sample per tree.
  2. At each split, only consider a random subset of features (typically sqrt(d) for classification).
  3. Grow deep trees with no pruning.
  4. Average predictions across all trees.

Why it works:

  • Each tree overfits differently. Averaging cancels the noise.
  • Feature subsetting de-correlates trees → averaging variance reduction works better.

Knobs:

  • n_estimators (more = smoother, slower)
  • max_features (lower = more diversity)
  • max_depth (rarely worth limiting)
3

Gradient Boosting

The recipe:

  1. Start with a constant prediction (mean target).
  2. Compute the residuals (errors).
  3. Fit a small tree to predict the residuals.
  4. Add this tree's prediction (× learning rate) to the running prediction.
  5. Repeat.

Each tree corrects what the previous ones got wrong.

Critical knobs:

KnobEffect
learning_rateSmaller = more trees needed, more stable. 0.05–0.1 typical.
n_estimatorsStop early — use early_stopping_rounds on a validation set.
max_depth4–8 is typical. Shallower than RF.
subsampleStochastic boosting. Adds variance, prevents overfit.
colsample_bytreeLike Random Forest's feature subsetting.
4

XGBoost vs LightGBM vs CatBoost

XGBoostLightGBMCatBoost
Tree growthLevel-wise (breadth-first)Leaf-wise (depth-first, lowest loss)Symmetric / oblivious
SpeedSolidFastest on large dataSlower train, fast inference
MemoryOKSmallest footprintOK
CategoricalsBasicIndexedNative ordered target encoding
Sample efficiencyHighHighBest
Overfit riskLowestHighest (leaf-wise)Low

Default picks:

  • Big data, tight memory → LightGBM
  • Lots of categoricals → CatBoost
  • Maximum control & regularisation → XGBoost
5

Missing values & categoricals

Missing values:

  • XGBoost, LightGBM: learn the best default direction at each split for missing values. No imputation needed.
  • CatBoost: similar handling.
  • scikit-learn HistGradientBoosting: native support too.

Categoricals:

  • CatBoost: native ordered target encoding — handles high-cardinality without leakage.
  • LightGBM: histogram-based categorical splits, pass categorical_feature=[col_indices].
  • XGBoost: modern versions support categoricals; older code needed one-hot.
6

Overfit & early stopping

Boosting will happily memorise the training data if you let it. Two safeguards:

  1. Early stopping — track validation loss, stop when it stops improving.
model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    early_stopping_rounds=50,
)
  1. Regularisationmin_child_samples, reg_alpha (L1), reg_lambda (L2), max_depth.

Rule: lower learning_rate + more rounds + early stopping ≫ higher learning_rate + fixed rounds.

7

When to choose each

SituationPick
Robust default, minimal tuningRandom Forest
Need probability calibrationRF (boosting probabilities are uncalibrated)
Want max leaderboard scoreXGBoost or LightGBM
Lots of categorical featuresCatBoost
Massive dataset, low memoryLightGBM
Time-series with lag featuresXGBoost / LightGBM with rolling-origin CV
Need interpretabilitySingle tree or RF with SHAP
8

Feature importance

Boosting models offer two flavours:

  • Gain — total impurity reduction by each feature. Most informative.
  • Cover — how many samples a feature affects.
  • Frequency — how often it was used in a split.
model.feature_importances_   # default = gain

For honest importance under correlated features: SHAP values.

import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)