Humanoid Robot
Group 19
Ruthwik Dasyam
Zahiruddin Mahammad
1. Gymnasium + MuJoCo
Gymnasium - Simple pythonic interface for RL problems - Open AI Gym fork
MuJoCo - A physics engine for model-based optimizations - environment visualizer
3D Bi-pedal Robot - Troso + head + 2 legs + 2 Arms
Algorithm : PPO (Proximal Poliicy Optimization)
An on-policy algorithm, can be used for environments with either discrete or continuous action spaces.
Learning Rate : 0.0003
Gamma : 0.99
OpenAI Gymnasium “Humanoid-v4”
Dimension of Action space - 17
Observation Space - 378
Body parts - 14, Joints - 23
Reward :
Algorithms Trained : SAC, PPO
SAC - Increasing Reward
PPO - Increasing Reward
Algorithm : SAC (Soft Actor-Critic)
An off-policy algorithm, can be used for environments with continuous action spaces.
Learning Rate : 0.0003
Gamma : 0.99
Gym Humanoid- PPO Results
Time Taken - 5 hrs
Max Reward - 382
This data shows the reach of a threshold, after which the policy isn’t expected to get optimized further.
Gym Humanoid- SAC Results
Time Taken - 3 hrs
Max Reward - 3500
This data shows the reach of a threshold, after which the policy isn’t expected to get optimized further.
Humanoid - Stompy
Gymnasium - Simple pythonic interface for RL problems - Open AI Gym fork
MuJoCo - A physics engine for model-based optimizations - environment visualizer
Model URDF Stompy in MuJoCo Stompy Training - PPO - Gym - Stable Baselines
Stompy - SAC
PPO Tuned
ent_coeff = 0.0
learning_rate = 0.0003
n_epochs=10,
gae_lambda = 0.95,
gamma = 0.99
batch_size = 64
Simulation OpenAI
Humanoid Test
PPO:
ent_coef=0.001,
learning_rate=1e-5,
n_epochs=2,
gae_lambda=0.9,
gamma = 0.99,
batch_size = 64
Reward Vs TimeSteps
2. BRAX + MJX
Brax - a physics engine used for research and development of reinforcement learning & Robotics and is designed for acceleration hardware. Scalable to parallel simulation.
MuJoCo XLA - MJX - a JAX reimplementation of the MuJoCo physics engine.
Humanoid Robot trained to reach max distance, Gait
Train Time : 10mins
Humanoid Robot trained to getup, as it spaws on the ground
Train Time : 10mins
Reward Vs TimeSteps
Humanoid - Stand Up
BRAX : Humanoid Gait Results
PPO
num_timesteps=30,000,000, num_evals=5,
episode_length=1000, num_minibatches=32, num_updates_per_batch=8,
learning_rate=3e-4, entropy_cost=1e-3,
num_envs=2048,
batch_size=1024,
Simulation BRAX
PPO
num_timesteps=30,000,000, num_evals=5,
episode_length=1000, num_minibatches=32, num_updates_per_batch=8,
learning_rate=3e-4, entropy_cost=1e-3,
num_envs=2048,
batch_size=1024,
discount_factor = 0.97
Parallel Training in multiple envs has significantly reduced training time
Overall Training Time : Approx 25 mins
Reducing the complexity of the model by eliminating the arms and other unnecessary parts
This resulted in increase in reward and reduced computation
Reward Vs TimeSteps
Complete Model - Issue : Unavailability of Computational Resource
Hence, Optimized model is considered as shown
URDF - XML
Computation Limits
Local Optimum
URDF Model is converted to XML, for MuJoCo. There are conditions like
These are subjected to change as per model, and should be updated before training.
The hyperramaters like batch size, num_timesteps, num_environments, learning_rate, episode_length decides the time of training, and how much computation is necessary. So, its important to optimize these values based on the computation units available for enhanced model.
Most of the times when hyper-parameters aren’t good for the model, the agent is prone to get stuck at sub-optimal policy, as the learning algorithm reaches local optimum
Exploration/Exploitation ratio
Reward Shaping, SGD
1010100010 ISSUES FACED 1010110010
References :
Future Work :
MuJoCo MJX - https://mujoco.readthedocs.io/en/stable/mjx.html
Open AI Gymnasium - https://gymnasium.farama.org/index.html
Stable baselines A2C - https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html
Stable baselines SAC - https://stable-baselines3.readthedocs.io/en/master/modules/sac.html
Stable baselines PPO - https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html
BRAX Documentation - https://github.com/google/brax
BRAX - https://arxiv.org/abs/2106.13281
Gym - https://arxiv.org/abs/1606.01540
Humanoid can be further trained on OpenAI Gymnasium + MuJoCo and controllers can be added.
This model can be 3d printed and The algorithm, after further training can be installed on the hardware.
This aims to reduce sim-real gap