The brief
Mercedes hackathon, sponsored by IE. 24 hours.
- Seven tables about used-car listings in the UK — vehicle specs, prices, locations, options, condition reports.
- 268,000 rows in the main table after the first join.
- No documentation. Column names like
transmission,engine_size,electrics,trim— and you have to guess what each one means.
The goal was predicting price, but most of the day went into cleaning.
Domain reads
Things that look like garbage but aren't, once you know cars:
engine_size = 1.6is litres, not arbitrary numbers.0.0means "electric, no combustion engine".transmissionstrings: "Auto", "Automatic", "Semi-Auto", "CVT", "Tiptronic" → all need normalising toauto / manual / semi-auto.electricsis a feature pack column ("heated seats, sat-nav"), not "electric vehicle".fuel_type"Electric" means EV."Hybrid"is partial."Plug-in"is a third thing.trimis a marketing-spec label ("Sport Edition", "AMG Line") — high-cardinality, but groupable.
None of this is in the data dictionary. All of it is obvious if you've ever looked at a car ad.
Cleaning moves
- Normalise transmission strings with a domain-informed lookup table.
- Encode
engine_size: keep numeric, but add anis_electricflag forengine_size = 0. - Collapse fuel_type into
petrol / diesel / hybrid / EV / other. - Group
trimby brand intostandard / premium / luxurytiers. - Drop
electricsfor the price model (it was a noisy multi-value field) OR explode it into individual binary feature flags. - Outlier prices — UK used-car prices follow a long right tail. Log-transform target.
Feature engineering
After cleaning:
age = current_year - manufacture_yearmileage_per_year = mileage / age(handles odd "1995 with 30k miles" cases)price_per_litre = price / engine_size(premium markers)- One-hot for
make,fuel_type, transmission tier - Target-encode
model(high cardinality) inside CV folds
The model
- Random Forest for a quick robust baseline.
- XGBoost with target-encoding of
modelfor the leaderboard submission. - Log-transformed target (
log1p(price)) to handle skew. Exponentiate at predict time.
Honest CV with GroupKFold by make to prevent leakage — otherwise the model learns "BMW always £30k" and looks falsely strong on random splits.
What I learned
- Domain knowledge is a force multiplier in data cleaning. Anyone can normalise strings; not everyone knows which strings mean the same thing.
- Compound-meaning columns (
electrics,trim) need a decision: drop, simplify, or explode. - Group-aware CV matters when a categorical feature dominates (here,
make). - Log-transform skewed targets. R² improves, residuals stop fanning out, predictions stop crashing on luxury cars.
- The hackathon judges noticed the cleaning narrative more than the model accuracy.