Maria Aguilera

Mercedes hackathon, sponsored by IE. 24 hours.

Seven tables about used-car listings in the UK — vehicle specs, prices, locations, options, condition reports.
268,000 rows in the main table after the first join.
No documentation. Column names like transmission, engine_size, electrics, trim — and you have to guess what each one means.

The goal was predicting price, but most of the day went into cleaning.

Things that look like garbage but aren't, once you know cars:

engine_size = 1.6 is litres, not arbitrary numbers. 0.0 means "electric, no combustion engine".
transmission strings: "Auto", "Automatic", "Semi-Auto", "CVT", "Tiptronic" → all need normalising to auto / manual / semi-auto.
electrics is a feature pack column ("heated seats, sat-nav"), not "electric vehicle".
fuel_type "Electric" means EV. "Hybrid" is partial. "Plug-in" is a third thing.
trim is a marketing-spec label ("Sport Edition", "AMG Line") — high-cardinality, but groupable.

None of this is in the data dictionary. All of it is obvious if you've ever looked at a car ad.

Normalise transmission strings with a domain-informed lookup table.
Encode engine_size: keep numeric, but add an is_electric flag for engine_size = 0.
Collapse fuel_type into petrol / diesel / hybrid / EV / other.
Group trim by brand into standard / premium / luxury tiers.
Drop electrics for the price model (it was a noisy multi-value field) OR explode it into individual binary feature flags.
Outlier prices — UK used-car prices follow a long right tail. Log-transform target.

After cleaning:

Random Forest for a quick robust baseline.
XGBoost with target-encoding of model for the leaderboard submission.
Log-transformed target (log1p(price)) to handle skew. Exponentiate at predict time.

Honest CV with GroupKFold by make to prevent leakage — otherwise the model learns "BMW always £30k" and looks falsely strong on random splits.

Domain knowledge is a force multiplier in data cleaning. Anyone can normalise strings; not everyone knows which strings mean the same thing.
Compound-meaning columns (electrics, trim) need a decision: drop, simplify, or explode.
Group-aware CV matters when a categorical feature dominates (here, make).
Log-transform skewed targets. R² improves, residuals stop fanning out, predictions stop crashing on luxury cars.
The hackathon judges noticed the cleaning narrative more than the model accuracy.

Mercedes Hackathon — Cheat Sheet