Maria Aguilera

Often confused — they're different stages:

	Validation	Evaluation
When	During training / model selection	Once, at the very end
On what	Validation fold / set	Held-out test set
Purpose	Pick hyperparameters, compare models	Honest production-quality measure
Frequency	Many times	Exactly once

If you tune on the test set, your evaluation is no longer evaluation — it's just more validation.

Every classification metric is born here. For binary:

	Predicted = 1	Predicted = 0
Actual = 1	TP	FN
Actual = 0	FP	TN

TP — correctly said yes.
FN — said no when it was yes. Type II error.
FP — said yes when it was no. Type I error.
TN — correctly said no.

Every metric is just a different ratio of these four numbers. Learn the matrix, the rest follows.

$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$

The headline metric — and the most overused.

It lies when:

Classes are imbalanced. Fraud is 0.1 % of transactions → predict "no fraud" always → 99.9 % accuracy → useless.
Mistakes have asymmetric costs. Missing a cancer is not equivalent to a false alarm. Accuracy weights them the same.

Use accuracy only when classes are roughly balanced AND mistakes cost the same on both sides.

The two real questions:

$\text{Precision} = \frac{TP}{TP + FP} \quad\text{Recall} = \frac{TP}{TP + FN}$

Precision — "When my model says yes, how often is it right?" Punishes false alarms.
Recall — "Of all the actual yes's, how many did I catch?" Punishes misses.

Which one matters more? Depends on the cost of mistakes:

High-stakes screening (cancer, fraud) → Recall. Missing a case is catastrophic.
Costly intervention (spam filter, manual review queue) → Precision. Each false alarm wastes resources.

The trade-off: raising the threshold ↑ precision but ↓ recall, and vice versa. There's no free lunch.

$F_1 = 2 \cdot \frac{P \cdot R}{P + R}$

Harmonic mean of precision and recall. Punishes extreme values — if either is near zero, F1 collapses.

F1 = 1 → perfect P and R both.
F1 = 0 → at least one of them is 0.

Use F1 when you want one number that balances precision and recall and don't have a clear preference.

F-beta lets you weight one more than the other: $F_\beta = (1+\beta^2) \frac{PR}{\beta^2 P + R}$

β > 1 favours recall; β < 1 favours precision.

Matthews Correlation Coefficient — the most honest single-number metric under imbalance.

$\text{MCC} = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$

Range: −1 to +1 (1 = perfect, 0 = random, −1 = perfectly wrong).
Uses all four cells of the confusion matrix — F1 ignores TN.
Honest under severe imbalance — accuracy and F1 both can mislead, MCC won't.

Use MCC when classes are imbalanced and you want a single trustworthy number.

"Better than random chance?"

$\kappa = \frac{p_o - p_e}{1 - p_e}$

Where p_o = observed accuracy, p_e = accuracy expected by chance.

κ = 1 → perfect agreement.
κ = 0 → no better than random guessing weighted by class frequencies.
κ < 0 → worse than random.

Useful when comparing your model against a baseline guesser. Also classic in inter-annotator agreement.

For more than 2 classes, you compute per-class metrics then average. Three ways:

Averaging	What it does	Use when
Macro	Unweighted mean across classes.	All classes equally important.
Weighted	Mean weighted by class support.	Account for class imbalance.
Micro	Aggregate TP/FP/FN across all classes, then compute.	One overall number; same as accuracy for balanced multi-class.

Rule of thumb: report macro-F1 if you care about every class equally (esp. minority classes). Report micro or weighted if you care about overall sample-level performance.

Part 4 · Classification Metrics — Cheat Sheet

Evaluation vs Validation

The confusion matrix

Accuracy & when it lies

Precision & Recall

F1 — balancing P & R

MCC — the fair one

Cohen's Kappa

Multi-class metrics