| Parameter | Sane default |
|---|---|
learning_rate | 3e-4 |
gamma (discount) | 0.99 |
n_steps per rollout | 2048 |
batch_size | 64 |
n_epochs per update | 10 |
clip_range | 0.2 |
ent_coef (entropy) | 0.0–0.01 |
The environment
OpenAI Gym LunarLander-v2:
- State: 8 continuous values — position (x, y), velocity, angle, angular velocity, leg-touching booleans.
- Action: 4 discrete — do nothing, fire main engine, left engine, right engine.
- Reward: approach landing pad (+), use fuel (−), crash (−100), land safely between flags (+100), legs touching (+10 each).
The goal: maximise cumulative reward → land between the flags without exhausting fuel.
Why PPO
Proximal Policy Optimisation is the default modern RL algorithm because:
- On-policy — uses fresh rollouts, more stable than off-policy DQN.
- Clipped surrogate objective — prevents the policy from jumping too far each update.
- Actor-Critic architecture:
- Actor — the policy network. Decides actions.
- Critic — value network. Estimates state value as a baseline.
- Discrete or continuous action spaces. Works on Lunar Lander out of the box.
Implementation: Stable-Baselines3.
Reward shaping
The default reward leads to surviving landings. To get precise landings:
- Penalise distance from the target zone (positive accuracy).
- Penalise angular velocity at touchdown (no wobble).
- Penalise vertical speed at touchdown (soft landing).
- Bonus for being well-centred when both legs touch.
Each shaping term shifts the policy's gradient toward "park nicely", not "survive".
Hyperparameters
Worth tuning:
Bigger n_steps = smoother gradient, slower wall-clock learning.
Training pattern
- Wrap env in
Monitorto log episode returns. - Train PPO for
1Mtimesteps as a baseline. - Plot rolling mean episode return — a clean upward curve means learning, a flat one means stuck.
- Eval on 20+ deterministic episodes at the end. Single episodes lie.
- Compare shaped vs default reward — same algorithm, different policy.
Bug-prone areas: off-by-one in reward shaping (reward should be a delta, not a level), missing done flags, normalisation drift between train and eval.
What I learned
- The reward function is the model. PPO just optimises whatever you wrote.
- Smooth landings ≠ accurate landings. They need different reward terms.
- Eval over many episodes. RL returns are wildly noisy; one good episode doesn't mean a good policy.
- PPO is slow to win and slow to lose. If your loss curve is flat, the reward is probably wrong — not the algorithm.