2 of 10

1. Gymnasium + MuJoCo

Gymnasium - Simple pythonic interface for RL problems - Open AI Gym fork

MuJoCo - A physics engine for model-based optimizations - environment visualizer

3D Bi-pedal Robot - Troso + head + 2 legs + 2 Arms

Algorithm : PPO (Proximal Poliicy Optimization)

An on-policy algorithm, can be used for environments with either discrete or continuous action spaces.

Learning Rate : 0.0003

Gamma : 0.99

OpenAI Gymnasium “Humanoid-v4”

Dimension of Action space - 17

Observation Space - 378

Body parts - 14, Joints - 23

Reward :

Every time-step humanoid is alive
walking forward
High contact or control force

Algorithms Trained : SAC, PPO

SAC - Increasing Reward

PPO - Increasing Reward

Algorithm : SAC (Soft Actor-Critic)

An off-policy algorithm, can be used for environments with continuous action spaces.

Learning Rate : 0.0003

Gamma : 0.99

3 of 10

Gym Humanoid- PPO Results

Time Taken - 5 hrs

Max Reward - 382

Train loss had a fluctuation at the start, then reduced constantly
Variance is increasing with time
Reward is getting stable with time steps - probably reached Sub-optimal Policy

This data shows the reach of a threshold, after which the policy isn’t expected to get optimized further.

Gym Humanoid- SAC Results

Time Taken - 3 hrs

Max Reward - 3500

Train loss reduced constantly
Critic loss is increasing with time
Reward is increasing with time steps

This data shows the reach of a threshold, after which the policy isn’t expected to get optimized further.

4 of 10

Humanoid - Stompy

Gymnasium - Simple pythonic interface for RL problems - Open AI Gym fork

MuJoCo - A physics engine for model-based optimizations - environment visualizer

Model URDF Stompy in MuJoCo Stompy Training - PPO - Gym - Stable Baselines

Stompy - SAC

5 of 10

PPO Tuned

ent_coeff = 0.0

learning_rate = 0.0003

n_epochs=10,

gae_lambda = 0.95,

gamma = 0.99

batch_size = 64

Simulation OpenAI

Humanoid imported in OpenAI Gymnasium Environment
Defined joint limits for lower and upper body

both of them showed extremely high actor and critic loss

Humanoid Test

PPO:

ent_coef=0.001,

learning_rate=1e-5,

n_epochs=2,

gae_lambda=0.9,

gamma = 0.99,

batch_size = 64

Reward Vs TimeSteps

6 of 10

2. BRAX + MJX

Brax - a physics engine used for research and development of reinforcement learning & Robotics and is designed for acceleration hardware. Scalable to parallel simulation.

MuJoCo XLA - MJX - a JAX reimplementation of the MuJoCo physics engine.

Humanoid Robot trained to reach max distance, Gait

Train Time : 10mins

Humanoid Robot trained to getup, as it spaws on the ground

Train Time : 10mins

Reward Vs TimeSteps

7 of 10

Humanoid - Stand Up

BRAX : Humanoid Gait Results

Reward per episode is observed to increase - 8500
Time steps - 0 to 35 million

PPO

num_timesteps=30,000,000, num_evals=5,

episode_length=1000, num_minibatches=32, num_updates_per_batch=8,

learning_rate=3e-4, entropy_cost=1e-3,

num_envs=2048,

batch_size=1024,

8 of 10

Simulation BRAX

PPO

num_timesteps=30,000,000, num_evals=5,

episode_length=1000, num_minibatches=32, num_updates_per_batch=8,

learning_rate=3e-4, entropy_cost=1e-3,

num_envs=2048,

batch_size=1024,

discount_factor = 0.97

Parallel Training in multiple envs has significantly reduced training time

Overall Training Time : Approx 25 mins

Reducing the complexity of the model by eliminating the arms and other unnecessary parts

This resulted in increase in reward and reduced computation

Reward Vs TimeSteps

Complete Model - Issue : Unavailability of Computational Resource

Hence, Optimized model is considered as shown

9 of 10

URDF - XML

Computation Limits

Local Optimum

URDF Model is converted to XML, for MuJoCo. There are conditions like

Collision model constraints
Joint limits tuning
Observation space dimensions

These are subjected to change as per model, and should be updated before training.

The hyperramaters like batch size, num_timesteps, num_environments, learning_rate, episode_length decides the time of training, and how much computation is necessary. So, its important to optimize these values based on the computation units available for enhanced model.

Most of the times when hyper-parameters aren’t good for the model, the agent is prone to get stuck at sub-optimal policy, as the learning algorithm reaches local optimum

CAUSES : Non-Convex Optimization

Exploration/Exploitation ratio

SOLUTIONS : Random Policy Initialize

Reward Shaping, SGD

1010100010 ISSUES FACED 1010110010

10 of 10

References :

Future Work :

MuJoCo MJX - https://mujoco.readthedocs.io/en/stable/mjx.html

Open AI Gymnasium - https://gymnasium.farama.org/index.html

Stable baselines A2C - https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html

Stable baselines SAC - https://stable-baselines3.readthedocs.io/en/master/modules/sac.html

Stable baselines PPO - https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html

BRAX Documentation - https://github.com/google/brax

BRAX - https://arxiv.org/abs/2106.13281

Gym - https://arxiv.org/abs/1606.01540

Humanoid can be further trained on OpenAI Gymnasium + MuJoCo and controllers can be added.

This model can be 3d printed and The algorithm, after further training can be installed on the hardware.

This aims to reduce sim-real gap

1 of 10

2 of 10

3 of 10

4 of 10

5 of 10

6 of 10

7 of 10

8 of 10

9 of 10

10 of 10