| Validation | Evaluation | |
|---|---|---|
| When | During training / model selection | Once, at the very end |
| On what | Validation fold / set | Held-out test set |
| Purpose | Pick hyperparameters, compare models | Honest production-quality measure |
| Frequency | Many times | Exactly once |
Evaluation vs Validation
Often confused — they're different stages:
If you tune on the test set, your evaluation is no longer evaluation — it's just more validation.
The confusion matrix
Every classification metric is born here. For binary:
| Predicted = 1 | Predicted = 0 | |
|---|---|---|
| Actual = 1 | TP | FN |
| Actual = 0 | FP | TN |
- TP — correctly said yes.
- FN — said no when it was yes. Type II error.
- FP — said yes when it was no. Type I error.
- TN — correctly said no.
Every metric is just a different ratio of these four numbers. Learn the matrix, the rest follows.
Accuracy & when it lies
The headline metric — and the most overused.
It lies when:
- Classes are imbalanced. Fraud is 0.1 % of transactions → predict "no fraud" always → 99.9 % accuracy → useless.
- Mistakes have asymmetric costs. Missing a cancer is not equivalent to a false alarm. Accuracy weights them the same.
Use accuracy only when classes are roughly balanced AND mistakes cost the same on both sides.
Precision & Recall
The two real questions:
- Precision — "When my model says yes, how often is it right?" Punishes false alarms.
- Recall — "Of all the actual yes's, how many did I catch?" Punishes misses.
Which one matters more? Depends on the cost of mistakes:
- High-stakes screening (cancer, fraud) → Recall. Missing a case is catastrophic.
- Costly intervention (spam filter, manual review queue) → Precision. Each false alarm wastes resources.
The trade-off: raising the threshold ↑ precision but ↓ recall, and vice versa. There's no free lunch.
F1 — balancing P & R
Harmonic mean of precision and recall. Punishes extreme values — if either is near zero, F1 collapses.
- F1 = 1 → perfect P and R both.
- F1 = 0 → at least one of them is 0.
Use F1 when you want one number that balances precision and recall and don't have a clear preference.
F-beta lets you weight one more than the other:
β > 1 favours recall; β < 1 favours precision.
MCC — the fair one
Matthews Correlation Coefficient — the most honest single-number metric under imbalance.
- Range: −1 to +1 (1 = perfect, 0 = random, −1 = perfectly wrong).
- Uses all four cells of the confusion matrix — F1 ignores TN.
- Honest under severe imbalance — accuracy and F1 both can mislead, MCC won't.
Use MCC when classes are imbalanced and you want a single trustworthy number.
Cohen's Kappa
"Better than random chance?"
Where p_o = observed accuracy, p_e = accuracy expected by chance.
- κ = 1 → perfect agreement.
- κ = 0 → no better than random guessing weighted by class frequencies.
- κ < 0 → worse than random.
Useful when comparing your model against a baseline guesser. Also classic in inter-annotator agreement.
Multi-class metrics
For more than 2 classes, you compute per-class metrics then average. Three ways:
| Averaging | What it does | Use when |
|---|---|---|
| Macro | Unweighted mean across classes. | All classes equally important. |
| Weighted | Mean weighted by class support. | Account for class imbalance. |
| Micro | Aggregate TP/FP/FN across all classes, then compute. | One overall number; same as accuracy for balanced multi-class. |
Rule of thumb: report macro-F1 if you care about every class equally (esp. minority classes). Report micro or weighted if you care about overall sample-level performance.