Cheat sheet

Training AWS DeepRacer — Cheat Sheet

A 1/18-scale autonomous car, a reward function, and the gap between simulator and a real track. The SageMaker + RoboMaker stack, PPO vs SAC, and sim-to-real strategies.

Read the full projectUpdated October 2023
1

The architecture

SageMaker (RL training job)

        │ policy update

RoboMaker (simulation)
   └─ Gazebo (physics)
   └─ ROS (state messages)

Redis (state + reward storage)
  • SageMaker holds the RL algorithm (PPO or SAC) + neural-net policy.
  • RoboMaker runs the simulated track in Gazebo with the ROS robotics framework.
  • Redis mediates state and reward signals between them.
2

The reward function

The single Python function that decides whether the car learns to drive or learns to spin.

Parameters available at every step:

ParameterMeaning
progress% of track completed (0–100).
speedCurrent car speed.
steering_angle−30° to +30°.
all_wheels_on_trackBoolean.
distance_from_centerDistance from track centreline.
closest_waypointsIndices of nearest waypoints.
is_offtrackBoolean — terminal failure.

Reward functions are shaping problems — too sparse and nothing learns, too dense and you overfit to the simulator.

3

PPO vs SAC

PPOSAC
TypeOn-policyOff-policy
Action spaceDiscrete or continuousContinuous
ExplorationStochastic policyEntropy-regularised
Sample efficiencyLowerHigher
StabilityMore stableSometimes brittle
DeepRacer defaultOption

Use PPO for first runs — it converges reliably with discrete action spaces. Use SAC when you need smoother continuous steering / throttle and have the compute budget.

4

Hyperparameters that matter

  • Learning rate — too high and the policy oscillates, too low and it stalls. 3e-4 is a sane default.
  • Batch size — bigger = smoother gradient, slower per step.
  • Discount factor γ — closer to 1 = longer-term thinking. 0.999 for racing, since reward is sparse until lap end.
  • Entropy coefficient — knob for exploration vs exploitation.
  • Number of epochs per update — too many → over-optimised against current rollouts → instability.
5

Sim-to-real transfer

The hardest part. The simulator is clean; the real track is messy. Three strategies:

  1. Domain randomisation — train across many simulated track conditions (lighting, friction, slight track shifts). The policy learns the invariants.
  2. Robust reward shaping — penalise behaviours that are fragile in physical space (sharp turns at high speed, hugging track edges).
  3. Conservative deployment — cap maximum speed in the deployed model below the simulator best, expecting real-world to underperform.

The car always drives worse on the real track. The question is by how much, and whether it finishes the lap.

6

What I learned

  • Reward shaping is 80 % of the project. The same RL algorithm with two different rewards produces wildly different drivers.
  • PPO is the right default. Don't reach for SAC unless you've exhausted PPO tuning.
  • Sim-to-real underperforms by 10–30 %. Plan for it.
  • The simulator's pretty visuals are the trap. Optimising for the sim's leaderboard ≠ optimising for the real track.