Post 5 — More Data, Better Data, and Domain Expertise

Last update: June 2026. All opinions are my own.

ML Foundations · Post 5/10

More data beats a cleverer algorithm

The most counter-intuitive result in the foundations. The classic Banko & Brill experiment (2001, natural-language disambiguation):

Take four different ML algorithms. Train each on increasing amounts of data. Plot accuracy vs training-set size.

With little data, the more sophisticated algorithm wins. So far, expected. But as you keep adding data, the gap shrinks. By the time you've got enough data, all four algorithms converge to roughly the same accuracy — around 95%. The fancy algorithm and the simple one perform nearly identically.

🔑 MORE DATA BEATS A CLEVERER ALGORITHM.

If you only remember one thing about the relationship between data and algorithms: more good data almost always beats a fancier model. Weeks invested in a better algorithm rarely pay off compared to weeks invested in better data.

When data is rubbish

There's a giant asterisk on the previous point.

⚠️ Garbage In, Garbage Out (GIGO). More data only helps if the data is good. Noisy, missing, biased, or unrepresentative samples will produce a worse model the more of it you have, not a better one.

What "good data" looks like in practice:

No errors. Someone listed as -1 years old, or 139. A solar-power reading of 0 that's actually the night-time value, not a missing one. Clean these before the algorithm sees them.
Novel. Duplicates and near-duplicates don't add information. Discard them.
Relevant. Data from the wrong domain (training on cats, deploying on dogs) won't help.
Representative. Training on people from one country and deploying everywhere is the classic trap. Your training set has to look like the data you'll see in production.

Domain expertise matters

The biggest signal that an ML project will work isn't the algorithm. It's the data scientist sitting down with someone who's been in the industry for 20 years and saying "so what actually matters to you?"

Those conversations produce features you'd never invent on your own. The expert knows:

Which seasonal effects matter.
Which weird outliers are real signal vs data-entry errors.
Which target variable is the one the business actually cares about — versus the one you've been computing.
Which costs are asymmetric (false negatives worse than false positives, or the reverse).

Whenever I work on a new problem, the first thing I do is ask the domain expert what they would want to know. Then I figure out how to engineer features that answer their question. It works better than starting from the data.

⭐ The right data + domain knowledge + ML = the best results.

Next up — Post 6: Learn Many Models, Not Just One.

Post 5 — More Data, Better Data, and Domain Expertise

Table of Contents

More data beats a cleverer algorithm

When data is rubbish

Domain expertise matters