| Component | Question it answers |
|---|---|
| Representation | What kind of model are we allowed to use? (linear, tree, neural net...) |
| Evaluation | How do we know if a model is good? (loss, accuracy, F1...) |
| Optimisation | How do we find the best model? (gradient descent, search...) |
What ML actually is
ML = programs that improve at a task with experience, instead of being explicitly programmed for every case.
Three components every algorithm has:
Pick all three. Most of ML engineering is choosing this triple wisely.
Generalisation > training error
The whole game.
- Underfit — model too simple, misses the pattern. High train + test error.
- Overfit — model too complex, memorises noise. Low train, high test error.
- Good fit — captures pattern, ignores noise. Low train + test error.
Symptoms of overfitting:
- Train accuracy ≫ test accuracy
- Small data shifts collapse performance
- Coefficients become huge / unstable
Fight back with: more data, simpler model, regularisation, cross-validation, early stopping.
Curse of dimensionality
In high dimensions:
- Space grows exponentially. A grid of 10 bins per feature needs
10^dcells. - All points become "far apart". Nearest-neighbour stops meaning neighbour.
- Distances concentrate. Min and max distances converge to similar values.
Three ways to fight back:
- Reduce dimensions (PCA, feature selection).
- Use models that bend to low-dimensional structure (trees, neural nets).
- Collect more data — but
nneeds to grow exponentially withd.
The blessing: real data usually lives near a low-dimensional manifold. The features may be 100, but the signal lives in 5.
Feature engineering is the key
"At the end of the day, some machine learning projects succeed and some fail. What makes the difference? The most important factor is the features used." — Pedro Domingos
What this looks like:
- Encoding
date→weekday,month,is_weekend - Combining
heightandweight→BMI - Log-transforming a skewed price column
- Replacing a zip code with the average house price in it
Good features make the algorithm's job trivial. Bad features no algorithm can rescue.
This is covered fully in Part 3.
Data alone is not enough — GIGO
Garbage in, garbage out.
- More data beats a cleverer algorithm — but only if the data is representative.
- Biased data → biased model. No algorithm fixes that.
- Data with errors, missing values, label noise → the model learns the noise.
The trade-off:
- More data, simple model > less data, fancy model.
- Better features, simple model > raw features, fancy model.
- Domain knowledge > brute compute, almost always.
Cleaning + feature engineering set the absolute ceiling on performance. The model just decides how close you get.
Learn many models, not one
The "no free lunch" theorem: no single model is best for all problems.
Pragmatic rule of thumb:
| If you have... | Start with... |
|---|---|
| Few features, lots of data, linear-ish | Logistic / Linear Regression |
| Tabular data, mixed types | XGBoost / LightGBM / Random Forest |
| High-dim continuous, small data | SVM, regularised linear |
| Images / audio / text | Neural networks |
| Want interpretability | Decision tree, linear with few features |
Then ensemble them — averaging or stacking near-always beats the best single model. Cross-validate everything.
Types of ML systems
By supervision:
- Supervised — labels given. Regression (continuous) or classification (discrete).
- Unsupervised — no labels. Clustering, dimensionality reduction, anomaly detection.
- Semi-supervised — few labels, many unlabelled examples.
- Reinforcement — agent interacts with environment, gets reward signal.
By learning style:
- Batch / offline — train on all data, deploy frozen model.
- Online / incremental — keep learning as new data arrives.
By generalisation:
- Instance-based — memorise examples, compare new ones (KNN).
- Model-based — fit a function, throw away examples (linear, NN).
The main challenges
The five failure modes every ML engineer should be able to name:
- Insufficient data. Especially for complex models. Bias-variance trade-off.
- Non-representative data. Sampling bias → model fails on the cases that matter.
- Poor quality data. Outliers, errors, missing values, mis-labels.
- Irrelevant features. Garbage features dilute signal. Feature selection matters.
- Overfitting & underfitting. The eternal trade-off — match model complexity to data size and signal.
Bonus: label shift / distribution shift in production. The world doesn't sit still. Monitor.
