1 of 28

Lab 5: How to Train Your Dog!

CS 123

modified from Jaden’s original slides

2 of 28

Let’s make Pupper walk!

CS 123

Check out Jaden’s work: https://lgpl-gaits.github.io/

3 of 28

Sim2Real Locomotion Policy Learning

3

4 of 28

Lab Overview

  1. Train your first policy using naive velocity tracking
  2. Effort conservation
  3. General reward tuning
    1. what auxiliary rewards are important?
  4. Deploy on Pupper
  5. Domain randomization
    • Make Pupper walk better in the real world!
  6. Ablation study & method report
  7. Additional optional lab released by end of this week (attempt at your own risk)

4

Despite little (no) coding needed, this is, in practice, a very complicated lab. Make sure to start early!

5 of 28

Google Colab Setup

  1. Copy the Colab
    1. Purchase Colab pro
    2. Save receipt
  2. Setup a wandb account
  3. Run the installs

(check lab document for concrete instructions)

5

6 of 28

Training Pupper to Walk with MuJoCo XLA (MJX)

What is MJX?

  • a modernized, JAX-based version of MuJoCo (Multi-Joint dynamics with Contact) physics simulator.

  • allows fast, differentiable, and hardware-accelerated physics simulations.

  • Supports parallel environments on CPU/GPU, enabling large-scale policy training.

  • Fully compatible with JAX's functional programming and autodiff ecosystem.

  • Cons: breakpoint(), regular print(), none of these debugging tools work when everything is traced (not an issue for this lab)

6

7 of 28

How We Train Pupper

  • Simulated in MJX: A detailed Pupper model with accurate mass, motor, and contact properties.

  • Reward Shaping: Encourages forward velocity, balance, and energy efficiency. (You will implement these)

  • Policy Training: Train walking policies using Proximal Policy Optimization (PPO), running thousands of parallel trainings simultaneously in GPU

  • Sim2Real Transfer: After training in simulation, deploy learned policies on real Pupper hardware.

7

8 of 28

Rewards in RL

Goal: optimize the reward

reward

state

action

8

9 of 28

What’s a good reward…

Well… the reward is a function of the of the state and action at time t

state/observation: 12-dof motor positions + velocities, roll/pitch/yaw, base velocities / orientation, foot contact forces/position, etc

action: normalized 12-dof motor position commands

9

10 of 28

We want Pupper to follow a velocity command

Is a velocity following command sufficient?

10

11 of 28

We want Pupper to follow a velocity command

Is a velocity following command sufficient?

In practice… NO

11

12 of 28

We want Pupper to follow a velocity command

Is a velocity following command sufficient?

In practice… NO

Why???

12

13 of 28

Teaching Pupper to walk is like teaching a toddler

13

14 of 28

What happens when you only use Velocity Command…

Command: 0.05 m/s

14

15 of 28

What happens when you only use Velocity Command…

15

16 of 28

Teaching Pupper to walk is like teaching a toddler

Can we just teach a baby to walk by giving candy when it goes forward??

16

17 of 28

Teaching Pupper to walk is like teaching a toddler

Can we just teach a baby to walk by giving candy when it goes forward??

Need to give it auxiliary tasks…

  • first standing up (survival)
  • walking while holding human to learn correct gait (reference motion, will be used in optional lab 2)
  • etc

17

18 of 28

Teaching Pupper to walk is like teaching a toddler

Shooting a basketball

  • Learn the correct form

18

19 of 28

Auxiliary rewards

Guiding gradient to correct optimum

19

20 of 28

Pupper needs auxiliary rewards too

How to encourage Pupper to walk with the correct gait?

Linear combination of differentiable rewards:

  • Efficiency
    • Torque penalties
    • DOF acceleration penalties
  • Stability
    • orientation penalty
  • Unnecessary contacts
    • knees / body hitting the ground
  • Robustness to different terrain
    • foot clearance

20

21 of 28

Pupper needs auxiliary rewards too

How to encourage Pupper to walk with the correct gait?

  • Efficiency
    • Torque penalties
    • DOF acceleration penalties
  • Stability
    • orientation penalty
  • Unnecessary contacts
    • knees / body hitting the ground
  • Robustness to different terrain
    • foot clearance

reward definitions: https://cs123-stanford.readthedocs.io/en/latest/_static/rewards.py

21

22 of 28

RL Workflow

  • Update reward weights

  • Train policy

  • Visualize policy + review curves in wandb

22

23 of 28

Domain randomization

System identification is never perfect…

Which terms to randomize?

23

units is in meters

24 of 28

Policy deployment

run on Pupper python3 deploy.py

Press different buttons on the controller to try out different walking gaits!

24

25 of 28

RL Workflow

  • Improve policy in sim
  • Update reward weights
  • Train policy
  • Visualize policy + review curves in wandb
  • Domain randomization
  • Deploy on Pupper

25

26 of 28

Challenge: Train the most agile policy (teaser for optional lab 2)

  1. Check out the linked articles for inspiration on good rewards
  2. Experiment with heightfield configs

Optional Lab Release:

  • The optional lab will be released this weekend.
  • It involves significantly more coding, experimentation and tuning than Optional Lab 1
  • Completing it will qualify your group for a very strong final project opportunity!

26

27 of 28

Quick Tips

Safety

    • Do not deploy policies on the table
    • Do not deploy policies that are highly unstable in sim
    • Be prepared to cancel potentially harmful policies (get your control+c ready at all times!)

General

    • Everyone should tune their own policies, and get practice deploying on the Pupper
    • Work together to decide which rewards to try out

27

28 of 28

General safety

  • Do not try and jam the HDMI cables into the pupper - a lot have been breaking (back in fall 2024, including my own Pupper back then 🥲). The cable should should fit easily into the port without force
  • Make sure Pupper is off power when modifying the hardware
    • battery is out
    • any time you are adjusting screws or manipulating hardware in any way
  • Don’t lose the batteries/battery chargers! These are expensive!

28