┌─────────┐ action a_t ┌─────────────┐
│ AGENT │ ─────────────────────────► │ ENVIRONMENT │
└─────────┘ └─────────────┘
▲ │
│ state s_{t+1}, reward r_{t+1} │
└──────────────────────────────────────────┘The setup
At each step:
- Agent observes state
s. - Picks action
afrom a policyπ(a | s). - Environment returns next state
s'and rewardr. - Goal: maximise the cumulative discounted reward:
G = Σ γ^t · r_t
Supervised vs RL
| Supervised | Reinforcement | |
|---|---|---|
| Signal | Correct label per sample | Scalar reward, often delayed |
| Data | Static dataset | Generated by interaction |
| Goal | Predict | Decide / control |
| Trade-off | Bias vs variance | Exploration vs exploitation |
| Failure mode | Overfit | Bad reward → bad policy |
In supervised, the dataset is given. In RL, the dataset is created by the policy itself — and bad policies generate bad data. That feedback loop is most of the hardness.
Exploration vs exploitation
The defining tension of RL.
- Exploit: pick the action your current best estimate says is best.
- Explore: try other actions to learn more.
Pure exploit → stuck in local optimum. Pure explore → never converge.
Standard approaches:
- ε-greedy — pick best action with probability
1−ε, random otherwise. - Boltzmann (softmax) exploration — sample from action values weighted by probability.
- Optimism in the face of uncertainty (UCB) — explore actions with high upper-confidence bounds.
- Entropy regularisation in policy methods — penalise overly deterministic policies.
Value-based methods
Learn a value function that estimates how good each state (or state-action pair) is.
- V(s) — expected return from state
s. - Q(s, a) — expected return from taking action
ain states, then following the policy.
Policy is derived: π(s) = argmax_a Q(s, a).
Key algorithms:
- Q-Learning — off-policy, updates
Qtowardr + γ · max_a' Q(s', a'). - SARSA — on-policy, uses the actual next action.
- DQN — deep Q-network. Replaces Q-table with a neural net. Atari-era breakthrough.
Policy-based methods
Directly parameterise the policy π_θ(a | s) and optimise θ to maximise expected reward.
- No value function needed (in pure form).
- Naturally handles continuous action spaces.
- Stochastic by default — great for partial observability.
Key algorithms:
- REINFORCE — vanilla policy gradient. High variance.
- A2C / A3C — actor-critic, adds a value baseline to reduce variance.
- PPO — clips the policy update step. The modern default. Stable and easy to tune.
- SAC — off-policy with entropy regularisation. Sample-efficient.
Actor-Critic
The dominant modern architecture combines both:
- Actor — the policy network. Decides actions.
- Critic — the value network. Estimates how good the state is.
The critic's estimate is used as a baseline to reduce variance in the actor's gradient updates.
PPO, A2C, SAC — all actor-critic under the hood.
Reward shaping
The reward function defines the problem. Designing it is most of the work.
- Sparse rewards (only at the goal) → nothing learns. Long credit-assignment chain.
- Dense rewards (every step) → fast learning, but you can shape behaviour you didn't intend.
- Shaped rewards can produce reward hacking — agent finds a loophole.
Rules of thumb:
- Start sparse. Add shaping terms only when nothing learns.
- Test if the agent maximises your goal or your reward proxy. They're not always the same.
- See real examples: DeepRacer cheat sheet, Lunar Lander cheat sheet.
Common environments
Practice grounds:
- Gymnasium (formerly OpenAI Gym) — CartPole, MountainCar, Lunar Lander, Atari.
- PettingZoo — multi-agent.
- MuJoCo / Brax — physics simulators for continuous control.
- RoboMaker + Gazebo — robotics-grade simulation. Used in AWS DeepRacer.
Libraries:
- Stable-Baselines3 — production-ready implementations of PPO, A2C, SAC, DQN.
- RLlib (Ray) — distributed RL.