Maria Aguilera

OpenAI Gym LunarLander-v2:

State: 8 continuous values — position (x, y), velocity, angle, angular velocity, leg-touching booleans.
Action: 4 discrete — do nothing, fire main engine, left engine, right engine.
Reward: approach landing pad (+), use fuel (−), crash (−100), land safely between flags (+100), legs touching (+10 each).

The goal: maximise cumulative reward → land between the flags without exhausting fuel.

Proximal Policy Optimisation is the default modern RL algorithm because:

On-policy — uses fresh rollouts, more stable than off-policy DQN.
Clipped surrogate objective — prevents the policy from jumping too far each update.
Actor-Critic architecture:
- Actor — the policy network. Decides actions.
- Critic — value network. Estimates state value as a baseline.
Discrete or continuous action spaces. Works on Lunar Lander out of the box.

Implementation: Stable-Baselines3.

The default reward leads to surviving landings. To get precise landings:

Penalise distance from the target zone (positive accuracy).
Penalise angular velocity at touchdown (no wobble).
Penalise vertical speed at touchdown (soft landing).
Bonus for being well-centred when both legs touch.

Each shaping term shifts the policy's gradient toward "park nicely", not "survive".

Worth tuning:

Parameter	Sane default
`learning_rate`	`3e-4`
`gamma` (discount)	`0.99`
`n_steps` per rollout	2048
`batch_size`	64
`n_epochs` per update	10
`clip_range`	0.2
`ent_coef` (entropy)	0.0–0.01

Bigger n_steps = smoother gradient, slower wall-clock learning.

Wrap env in Monitor to log episode returns.
Train PPO for 1M timesteps as a baseline.
Plot rolling mean episode return — a clean upward curve means learning, a flat one means stuck.
Eval on 20+ deterministic episodes at the end. Single episodes lie.
Compare shaped vs default reward — same algorithm, different policy.

Bug-prone areas: off-by-one in reward shaping (reward should be a delta, not a level), missing done flags, normalisation drift between train and eval.

The reward function is the model. PPO just optimises whatever you wrote.
Smooth landings ≠ accurate landings. They need different reward terms.
Eval over many episodes. RL returns are wildly noisy; one good episode doesn't mean a good policy.
PPO is slow to win and slow to lose. If your loss curve is flat, the reward is probably wrong — not the algorithm.

Lunar Lander PPO — Cheat Sheet

The environment

Why PPO

Reward shaping

Hyperparameters

Training pattern

What I learned