Maria Aguilera

┌─────────┐         action a_t          ┌─────────────┐
│  AGENT  │ ─────────────────────────►  │ ENVIRONMENT │
└─────────┘                              └─────────────┘
     ▲                                          │
     │ state s_{t+1}, reward r_{t+1}            │
     └──────────────────────────────────────────┘

At each step:

Agent observes state s.
Picks action a from a policy π(a | s).
Environment returns next state s' and reward r.
Goal: maximise the cumulative discounted reward: G = Σ γ^t · r_t

	Supervised	Reinforcement
Signal	Correct label per sample	Scalar reward, often delayed
Data	Static dataset	Generated by interaction
Goal	Predict	Decide / control
Trade-off	Bias vs variance	Exploration vs exploitation
Failure mode	Overfit	Bad reward → bad policy

In supervised, the dataset is given. In RL, the dataset is created by the policy itself — and bad policies generate bad data. That feedback loop is most of the hardness.

The defining tension of RL.

Exploit: pick the action your current best estimate says is best.
Explore: try other actions to learn more.

Pure exploit → stuck in local optimum. Pure explore → never converge.

Standard approaches:

ε-greedy — pick best action with probability 1−ε, random otherwise.
Boltzmann (softmax) exploration — sample from action values weighted by probability.
Optimism in the face of uncertainty (UCB) — explore actions with high upper-confidence bounds.
Entropy regularisation in policy methods — penalise overly deterministic policies.

Learn a value function that estimates how good each state (or state-action pair) is.

V(s) — expected return from state s.
Q(s, a) — expected return from taking action a in state s, then following the policy.

Policy is derived: π(s) = argmax_a Q(s, a).

Key algorithms:

Q-Learning — off-policy, updates Q toward r + γ · max_a' Q(s', a').
SARSA — on-policy, uses the actual next action.
DQN — deep Q-network. Replaces Q-table with a neural net. Atari-era breakthrough.

Directly parameterise the policy π_θ(a | s) and optimise θ to maximise expected reward.

No value function needed (in pure form).
Naturally handles continuous action spaces.
Stochastic by default — great for partial observability.

Key algorithms:

REINFORCE — vanilla policy gradient. High variance.
A2C / A3C — actor-critic, adds a value baseline to reduce variance.
PPO — clips the policy update step. The modern default. Stable and easy to tune.
SAC — off-policy with entropy regularisation. Sample-efficient.

The dominant modern architecture combines both:

Actor — the policy network. Decides actions.
Critic — the value network. Estimates how good the state is.

The critic's estimate is used as a baseline to reduce variance in the actor's gradient updates.

PPO, A2C, SAC — all actor-critic under the hood.

The reward function defines the problem. Designing it is most of the work.

Sparse rewards (only at the goal) → nothing learns. Long credit-assignment chain.
Dense rewards (every step) → fast learning, but you can shape behaviour you didn't intend.
Shaped rewards can produce reward hacking — agent finds a loophole.

Rules of thumb:

Start sparse. Add shaping terms only when nothing learns.
Test if the agent maximises your goal or your reward proxy. They're not always the same.
See real examples: DeepRacer cheat sheet, Lunar Lander cheat sheet.

Practice grounds:

Gymnasium (formerly OpenAI Gym) — CartPole, MountainCar, Lunar Lander, Atari.
PettingZoo — multi-agent.
MuJoCo / Brax — physics simulators for continuous control.
RoboMaker + Gazebo — robotics-grade simulation. Used in AWS DeepRacer.

Libraries:

Stable-Baselines3 — production-ready implementations of PPO, A2C, SAC, DQN.
RLlib (Ray) — distributed RL.

From Prediction to Decision — RL Cheat Sheet

The setup

Supervised vs RL

Exploration vs exploitation

Value-based methods

Policy-based methods

Actor-Critic

Reward shaping

Common environments