Cheat sheet

Lunar Lander PPO — Cheat Sheet

Training a PPO agent to land between two flags. Reward shaping for accuracy, stability over speed, and what the actor-critic split actually does.

Read the full projectUpdated January 2026
1

The environment

OpenAI Gym LunarLander-v2:

  • State: 8 continuous values — position (x, y), velocity, angle, angular velocity, leg-touching booleans.
  • Action: 4 discrete — do nothing, fire main engine, left engine, right engine.
  • Reward: approach landing pad (+), use fuel (−), crash (−100), land safely between flags (+100), legs touching (+10 each).

The goal: maximise cumulative reward → land between the flags without exhausting fuel.

2

Why PPO

Proximal Policy Optimisation is the default modern RL algorithm because:

  • On-policy — uses fresh rollouts, more stable than off-policy DQN.
  • Clipped surrogate objective — prevents the policy from jumping too far each update.
  • Actor-Critic architecture:
    • Actor — the policy network. Decides actions.
    • Critic — value network. Estimates state value as a baseline.
  • Discrete or continuous action spaces. Works on Lunar Lander out of the box.

Implementation: Stable-Baselines3.

3

Reward shaping

The default reward leads to surviving landings. To get precise landings:

  • Penalise distance from the target zone (positive accuracy).
  • Penalise angular velocity at touchdown (no wobble).
  • Penalise vertical speed at touchdown (soft landing).
  • Bonus for being well-centred when both legs touch.

Each shaping term shifts the policy's gradient toward "park nicely", not "survive".

4

Hyperparameters

Worth tuning:

ParameterSane default
learning_rate3e-4
gamma (discount)0.99
n_steps per rollout2048
batch_size64
n_epochs per update10
clip_range0.2
ent_coef (entropy)0.0–0.01

Bigger n_steps = smoother gradient, slower wall-clock learning.

5

Training pattern

  1. Wrap env in Monitor to log episode returns.
  2. Train PPO for 1M timesteps as a baseline.
  3. Plot rolling mean episode return — a clean upward curve means learning, a flat one means stuck.
  4. Eval on 20+ deterministic episodes at the end. Single episodes lie.
  5. Compare shaped vs default reward — same algorithm, different policy.

Bug-prone areas: off-by-one in reward shaping (reward should be a delta, not a level), missing done flags, normalisation drift between train and eval.

6

What I learned

  • The reward function is the model. PPO just optimises whatever you wrote.
  • Smooth landings ≠ accurate landings. They need different reward terms.
  • Eval over many episodes. RL returns are wildly noisy; one good episode doesn't mean a good policy.
  • PPO is slow to win and slow to lose. If your loss curve is flat, the reward is probably wrong — not the algorithm.