Cheat sheet

Part 1 · What is Machine Learning? — Cheat Sheet

The foundations. Decisions, definitions, and pitfalls that decide whether a model generalises or memorises.

Part 1 · What is Machine Learning? — Cheat Sheet — printable cheat sheet
Download PNG

Or read the searchable version below.

1

What ML actually is

ML = programs that improve at a task with experience, instead of being explicitly programmed for every case.

Three components every algorithm has:

ComponentQuestion it answers
RepresentationWhat kind of model are we allowed to use? (linear, tree, neural net...)
EvaluationHow do we know if a model is good? (loss, accuracy, F1...)
OptimisationHow do we find the best model? (gradient descent, search...)

Pick all three. Most of ML engineering is choosing this triple wisely.

2

Generalisation > training error

The whole game.

  • Underfit — model too simple, misses the pattern. High train + test error.
  • Overfit — model too complex, memorises noise. Low train, high test error.
  • Good fit — captures pattern, ignores noise. Low train + test error.

Symptoms of overfitting:

  • Train accuracy ≫ test accuracy
  • Small data shifts collapse performance
  • Coefficients become huge / unstable

Fight back with: more data, simpler model, regularisation, cross-validation, early stopping.

3

Curse of dimensionality

In high dimensions:

  • Space grows exponentially. A grid of 10 bins per feature needs 10^d cells.
  • All points become "far apart". Nearest-neighbour stops meaning neighbour.
  • Distances concentrate. Min and max distances converge to similar values.

Three ways to fight back:

  1. Reduce dimensions (PCA, feature selection).
  2. Use models that bend to low-dimensional structure (trees, neural nets).
  3. Collect more data — but n needs to grow exponentially with d.

The blessing: real data usually lives near a low-dimensional manifold. The features may be 100, but the signal lives in 5.

4

Feature engineering is the key

"At the end of the day, some machine learning projects succeed and some fail. What makes the difference? The most important factor is the features used." — Pedro Domingos

What this looks like:

  • Encoding dateweekday, month, is_weekend
  • Combining height and weightBMI
  • Log-transforming a skewed price column
  • Replacing a zip code with the average house price in it

Good features make the algorithm's job trivial. Bad features no algorithm can rescue.

This is covered fully in Part 3.

5

Data alone is not enough — GIGO

Garbage in, garbage out.

  • More data beats a cleverer algorithm — but only if the data is representative.
  • Biased data → biased model. No algorithm fixes that.
  • Data with errors, missing values, label noise → the model learns the noise.

The trade-off:

  • More data, simple model > less data, fancy model.
  • Better features, simple model > raw features, fancy model.
  • Domain knowledge > brute compute, almost always.

Cleaning + feature engineering set the absolute ceiling on performance. The model just decides how close you get.

6

Learn many models, not one

The "no free lunch" theorem: no single model is best for all problems.

Pragmatic rule of thumb:

If you have...Start with...
Few features, lots of data, linear-ishLogistic / Linear Regression
Tabular data, mixed typesXGBoost / LightGBM / Random Forest
High-dim continuous, small dataSVM, regularised linear
Images / audio / textNeural networks
Want interpretabilityDecision tree, linear with few features

Then ensemble them — averaging or stacking near-always beats the best single model. Cross-validate everything.

7

Types of ML systems

By supervision:

  • Supervised — labels given. Regression (continuous) or classification (discrete).
  • Unsupervised — no labels. Clustering, dimensionality reduction, anomaly detection.
  • Semi-supervised — few labels, many unlabelled examples.
  • Reinforcement — agent interacts with environment, gets reward signal.

By learning style:

  • Batch / offline — train on all data, deploy frozen model.
  • Online / incremental — keep learning as new data arrives.

By generalisation:

  • Instance-based — memorise examples, compare new ones (KNN).
  • Model-based — fit a function, throw away examples (linear, NN).
8

The main challenges

The five failure modes every ML engineer should be able to name:

  1. Insufficient data. Especially for complex models. Bias-variance trade-off.
  2. Non-representative data. Sampling bias → model fails on the cases that matter.
  3. Poor quality data. Outliers, errors, missing values, mis-labels.
  4. Irrelevant features. Garbage features dilute signal. Feature selection matters.
  5. Overfitting & underfitting. The eternal trade-off — match model complexity to data size and signal.

Bonus: label shift / distribution shift in production. The world doesn't sit still. Monitor.