Cheat sheet

Part 7 · Decision Trees — Cheat Sheet

How trees actually split, why Gini vs entropy barely matters, how depth controls overfitting, and the interpretability trade-off.

1

How a split is chosen

At every node, the tree:

  1. Considers each feature.
  2. For each feature, tries possible split points (thresholds for numeric, group splits for categorical).
  3. Scores each candidate by how much impurity it removes.
  4. Picks the split with the largest impurity reduction.

Recursively splits until a stopping rule fires (max depth, min samples, no impurity gain).

2

Impurity measures

For classification (smaller = purer node):

  • Gini impurity1 − Σ p². Probability of misclassifying a random sample if you label it by class distribution.
  • Entropy−Σ p · log(p). Information-theoretic measure of disorder.

Truth: Gini and entropy almost always pick the same splits. Gini is slightly faster; entropy slightly favours balanced splits. Don't lose sleep over it.

For regression:

  • MSE — mean squared error within the node. Splits minimise variance.
3

Why trees overfit

A decision tree with no depth limit will keep splitting until every leaf is pure — one sample per leaf if needed. Train accuracy = 100 %. Test accuracy = wherever the noise sends it.

Knobs to control overfitting:

HyperparameterEffect
max_depthHard cap on tree depth. Smaller = more bias, less variance.
min_samples_splitNeed ≥ N samples to consider a split.
min_samples_leafEach leaf must have ≥ N samples. Strong regulariser.
max_featuresOnly consider a subset at each split. Adds variance to bagging.
ccp_alphaCost-complexity pruning. Penalty per leaf.
4

Categorical handling

scikit-learn's DecisionTreeClassifier treats numeric splits only — you have to encode categoricals first.

EncodingTree behaviour
OrdinalTree splits at a number, but only matches if the order is meaningful.
One-hotTree picks "is category X" or "not". Each cat = one binary feature.
Target / meanTree splits at a threshold of the mean target — works well, watch leakage.

LightGBM and CatBoost can handle categoricals natively without these workarounds — see Part 8.

5

What trees don't need

The list of things trees don't care about:

  • Scaling. Splits use thresholds; multiplying a feature by 1000 doesn't change which rows are above or below.
  • Outliers. A single weird value only affects its own leaf.
  • Feature distributions. No Gaussian assumption.
  • Correlated features. Trees just pick one of the correlated pair.

This makes trees a fast first choice on messy tabular data — minimal preprocessing.

6

Interpretability

A small tree (depth ≤ 4) is the most interpretable model in ML:

if age > 50:
    if blood_pressure > 140:
        → "high risk"
    else:
        → "low risk"
else:
    → "low risk"

A doctor can read it and audit it. A deep tree loses this — by depth 20, the rules are gibberish to humans.

The trade-off: shallow trees are interpretable but biased. Use a small tree for explanation, a forest / boosting for performance.

7

Feature importance

Trees can rank features by how much impurity they remove across all splits:

clf.feature_importances_

Caveats:

  • Biased toward high-cardinality features (more thresholds to try).
  • Splits credit only one of correlated features — the other looks useless.
  • Not the same as causal importance.

For more honest importance: permutation importance (sklearn.inspection).

8

When to use a single tree

Mostly: don't. Single trees overfit too easily and lose to forests / boosting in pure performance.

Exceptions:

  • You need to explain the model to a stakeholder.
  • Inference must be < 1 ms on a constrained device.
  • You want a quick visualisation of which features split where.
  • Baseline before ensembles to see what depth-2 already gets you.

For real prediction tasks → jump straight to Random Forest / XGBoost / LightGBM (Part 8).