Cheat sheet

Spaceship Titanic — Cheat Sheet

When missing values aren't random. Reading the PassengerID and Cabin schemas to impute relationally, then a 10-model bake-off on the Kaggle classification task.

Read the full projectUpdated June 2026
1

The task

Kaggle binary classification: predict whether a passenger was transported to another dimension after a collision in the Spaceship Titanic.

~8,700 train rows, ~4,300 test. Features include passenger demographics, cabin, group bookings, and spend at on-board services.

A standard impute-encode-train task on the surface. The interesting layer is what the IDs actually mean.

2

The schema reading

Two hidden structures:

PassengerID = "gggg_pp"

  • gggg = group / travel-party number.
  • pp = position within the group.
  • People with the same gggg are family or friends travelling together.

Cabin = "Deck/Num/Side"

  • Deck = letter A–G.
  • Num = cabin number.
  • Side = P (port) or S (starboard).

Both are categorical structures encoded as strings. The model can't see them unless you split them out.

3

Relational imputation

Once group_id is exposed:

  • Same group → same Cabin / HomePlanet / Destination. If one passenger's value is known, fill the rest of the group.
  • Same group → similar CryoSleep status. Family decisions tend to align.
  • Spend columns when CryoSleep = True → 0. People in cryo can't shop.

Replacing mean / median imputation with structural rules plugged the bulk of the missingness without inventing data.

4

Feature engineering moves

  • group_size — count of passengers per group_id.
  • is_solo — binary, group_size == 1.
  • cabin_deck, cabin_num, cabin_side — split out from raw Cabin.
  • total_spend — sum of RoomService + FoodCourt + ShoppingMall + Spa + VRDeck.
  • spent_anything — binary total_spend > 0.

Each one came from a hypothesis: what would make a group behave the same way?

5

The bake-off

Ten models, same pipeline, same CV scheme:

  • Logistic Regression
  • KNN
  • Naïve Bayes
  • Decision Tree
  • Random Forest
  • Gradient Boosting (sklearn)
  • XGBoost
  • LightGBM
  • CatBoost
  • Stacked ensemble

Stratified 5-fold CV, accuracy + F1 reported.

Winner: Gradient Boosting variants on top, ensemble slightly above each individually. Margin was a few tenths of a percent.

6

What I learned

  • Read the data schema first. Anything stored as a compound string (A/137/P, 0007_02) is begging to be split.
  • Structural imputation > statistical imputation. Use relationships before you reach for the mean.
  • Bake-offs are useful, but the gap between #1 and #5 is usually smaller than the gap between bad and good features.
  • Stratified CV matters when classes are imbalanced — even slightly.