| Step | Meaning |
|---|---|
| Select data | Choose and collect input for future cases you want to generalise to. |
| Preprocess | Clean errors, nulls, outliers, formats, irrelevant info. |
| Transform | Scale, encode, bin, log-transform, build features. |
| Model | Train and validate. Use CV for tuning, test only at the end. |
ML Process
Before you touch data: Is ML even the right tool? If a deterministic rule works, ML may be unnecessary.
Understanding the data
- Feature meaning first.
-1may be impossible age but a valid code elsewhere. - Representative data. Small samples → sampling noise. Flawed sampling → sampling bias.
- Biases (biased data → biased model):
- Volunteer — participants differ from non-participants.
- Selection — sample drawn from a narrow subgroup.
- Survival — looking only at what passed the filter (the WWII planes).
- Supervised = target given (regression / classification). Unsupervised = no target labels.
Scaling — when it matters
Mandatory
KNN · SVM · PCA · gradient descent · regularised linear/logistic
Recommended
Neural nets · Linear/Logistic regression
Optional
Decision trees · Random Forest · Gradient Boosted Trees
| Scaler | Use when |
|---|---|
StandardScaler | Default. Mean 0, std 1. No strong outliers. |
RobustScaler | Outliers present. Uses median + IQR. |
MinMaxScaler | Need bounded range [0, 1] or non-negative. |
Normalizer | Each row is a vector; angle/cosine matters (text, recommenders). |
scaler = StandardScaler()
scaler.fit(X_train) # train only
X_train_s = scaler.transform(X_train)
X_test_s = scaler.transform(X_test) # NEVER fit on testCategorical features
| Encoding | Use when | Main trap |
|---|---|---|
| Ordinal | Real order: low < medium < high. | Fake distance/order on nominal categories. |
| One-hot | Nominal, manageable cardinality. | More columns + collinearity. |
| Target | High cardinality (zip code). | Uses y → leakage if outside CV. |
- Zip code / product ID = category, even when stored as a number.
- Binary
Yes/No→ one dummy column is enough. - Use
OneHotEncoder(notpd.get_dummies) inside pipelines — safer for ML workflow. - Tree models tolerate correlated dummies better than linear models.
preprocess = make_column_transformer(
(StandardScaler(), numeric_cols),
(OneHotEncoder(handle_unknown="ignore"), cat_cols),
)Outliers
Ask first, then act:
| Question | Action |
|---|---|
| Is it an error? | Correct, remove, or impute. |
| Valid and meaningful? | Keep. Use robust scaler / log / add indicator. |
| Target of interest (fraud, failure)? | Do not remove — model needs it. |
| Tiny and non-representative? | Remove if it won't bias. |
Detection:
- Z-score —
|z| > 3. Assumes ~Gaussian. - IQR — outside
[Q1 − 1.5·IQR, Q3 + 1.5·IQR]. No distribution assumption. - Model-based —
IsolationForest,OneClassSVM,LocalOutlierFactor, robust covariance. Best for high-dim / non-Gaussian.
Null values
| Type | Meaning | Action |
|---|---|---|
| MCAR | Random missingness. | Remove if few; impute if many. |
| MAR | Depends on observed vars. | Impute with multivariate (KNN/Iterative). |
| MNAR | Missingness itself is signal. | Add MissingIndicator + impute. |
| Situation | Best start |
|---|---|
| Numerical missing | Median (robust to outliers). |
| Categorical missing | Most frequent or constant "Missing". |
| Missingness meaningful | add_indicator=True. |
| Need relationships | KNNImputer / IterativeImputer. |
| > 50 % missing, no meaning | Drop column. |
num_pipe = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
cat_pipe = make_pipeline(
SimpleImputer(strategy="most_frequent"),
OneHotEncoder(handle_unknown="ignore"),
)Other preprocessing
- Skewness — Log / Box-Cox on long right tails (prices, income, time).
- Binning — Group levels to reduce cardinality.
- Discretisation — Continuous → categories via
KBinsDiscretizer. - Typing — Dates as
datetime64, IDs as category, currency strings → numeric. 0≠ missing. Always check the domain meaning before treating zeros.
Pipeline template
numeric_features = ["age", "salary"]
categorical_features = ["boro", "zipcode"]
num_pipe = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
cat_pipe = make_pipeline(
SimpleImputer(strategy="most_frequent"),
OneHotEncoder(handle_unknown="ignore"),
)
preprocess = make_column_transformer(
(num_pipe, numeric_features),
(cat_pipe, categorical_features),
)
pipe = make_pipeline(preprocess, LogisticRegression(max_iter=1000))
# Cross-validate the WHOLE pipeline
scores = cross_val_score(pipe, X, y, cv=5)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)