1 of 28

Lab 5: How to Train Your Dog!

CS 123

modified from Jaden’s original slides

2 of 28

Let’s make Pupper walk!

CS 123

Check out Jaden’s work: https://lgpl-gaits.github.io/

3 of 28

Sim2Real Locomotion Policy Learning

4 of 28

Lab Overview

Train your first policy using naive velocity tracking
Effort conservation
General reward tuning

what auxiliary rewards are important?

Deploy on Pupper
Domain randomization

Make Pupper walk better in the real world!

Ablation study & method report
Additional optional lab released by end of this week (attempt at your own risk)

Despite little (no) coding needed, this is, in practice, a very complicated lab. Make sure to start early!

5 of 28

Google Colab Setup

Copy the Colab

Purchase Colab pro
Save receipt

Setup a wandb account
Run the installs

(check lab document for concrete instructions)

6 of 28

Training Pupper to Walk with MuJoCo XLA (MJX)

What is MJX?

a modernized, JAX-based version of MuJoCo (Multi-Joint dynamics with Contact) physics simulator.

allows fast, differentiable, and hardware-accelerated physics simulations.

Supports parallel environments on CPU/GPU, enabling large-scale policy training.

Fully compatible with JAX's functional programming and autodiff ecosystem.

Cons: breakpoint(), regular print(), none of these debugging tools work when everything is traced (not an issue for this lab)

7 of 28

How We Train Pupper

Simulated in MJX: A detailed Pupper model with accurate mass, motor, and contact properties.

Reward Shaping: Encourages forward velocity, balance, and energy efficiency. (You will implement these)

Policy Training: Train walking policies using Proximal Policy Optimization (PPO), running thousands of parallel trainings simultaneously in GPU

Sim2Real Transfer: After training in simulation, deploy learned policies on real Pupper hardware.

8 of 28

Rewards in RL

Goal: optimize the reward

reward

state

action

9 of 28

What’s a good reward…

Well… the reward is a function of the of the state and action at time t

state/observation: 12-dof motor positions + velocities, roll/pitch/yaw, base velocities / orientation, foot contact forces/position, etc

action: normalized 12-dof motor position commands

10 of 28

We want Pupper to follow a velocity command

Is a velocity following command sufficient?

11 of 28

We want Pupper to follow a velocity command

Is a velocity following command sufficient?

In practice… NO

12 of 28

We want Pupper to follow a velocity command

Is a velocity following command sufficient?

In practice… NO

Why???

13 of 28

Teaching Pupper to walk is like teaching a toddler

14 of 28

What happens when you only use Velocity Command…

Command: 0.05 m/s

15 of 28

What happens when you only use Velocity Command…

16 of 28

Teaching Pupper to walk is like teaching a toddler

Can we just teach a baby to walk by giving candy when it goes forward??

17 of 28

Teaching Pupper to walk is like teaching a toddler

Can we just teach a baby to walk by giving candy when it goes forward??

Need to give it auxiliary tasks…

first standing up (survival)
walking while holding human to learn correct gait (reference motion, will be used in optional lab 2)
etc

18 of 28

Teaching Pupper to walk is like teaching a toddler

Shooting a basketball

Learn the correct form

19 of 28

Auxiliary rewards

Guiding gradient to correct optimum

20 of 28

Pupper needs auxiliary rewards too

How to encourage Pupper to walk with the correct gait?

Linear combination of differentiable rewards:

Efficiency

Torque penalties
DOF acceleration penalties

Stability

orientation penalty

Unnecessary contacts

knees / body hitting the ground

Robustness to different terrain

foot clearance

21 of 28

Pupper needs auxiliary rewards too

How to encourage Pupper to walk with the correct gait?

Efficiency

Torque penalties
DOF acceleration penalties

Stability

orientation penalty

Unnecessary contacts

knees / body hitting the ground

Robustness to different terrain

foot clearance

reward definitions: https://cs123-stanford.readthedocs.io/en/latest/_static/rewards.py

22 of 28

RL Workflow

Update reward weights

Train policy

Visualize policy + review curves in wandb

23 of 28

Domain randomization

System identification is never perfect…

Which terms to randomize?

units is in meters

24 of 28

Policy deployment

run on Pupper python3 deploy.py

Press different buttons on the controller to try out different walking gaits!

25 of 28

RL Workflow

Improve policy in sim
Update reward weights
Train policy
Visualize policy + review curves in wandb
Domain randomization
Deploy on Pupper

26 of 28

Challenge: Train the most agile policy (teaser for optional lab 2)

Check out the linked articles for inspiration on good rewards
Experiment with heightfield configs

Optional Lab Release:

The optional lab will be released this weekend.
It involves significantly more coding, experimentation and tuning than Optional Lab 1
Completing it will qualify your group for a very strong final project opportunity!

27 of 28

Quick Tips

Safety

Do not deploy policies on the table
Do not deploy policies that are highly unstable in sim
Be prepared to cancel potentially harmful policies (get your control+c ready at all times!)

General

Everyone should tune their own policies, and get practice deploying on the Pupper
Work together to decide which rewards to try out

28 of 28

General safety

Do not try and jam the HDMI cables into the pupper - a lot have been breaking (back in fall 2024, including my own Pupper back then 🥲). The cable should should fit easily into the port without force
Make sure Pupper is off power when modifying the hardware

battery is out
any time you are adjusting screws or manipulating hardware in any way

Don’t lose the batteries/battery chargers! These are expensive!