Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
Presented by Rafael Hernandez and Andrei Mircea
General Presentation Outline:
What are the previous versions of MuZero?
How can Monte Carlo Tree Search be used to plan with the MuZero neural networks?
How does MuZero learn from its environment?
Unrolling the model alongside the collected experience?
Putting all the pieces together?
Method
Updating the Learned Model
Aligning training with episode generation
Policy loss
Reward loss
Value loss
Method
Reanalyze
Reanalyze: continued learning from past trajectories
Normal training
Normal training
Model learns on-policy (i.e. with trajectories generated by its policy)
Actor (T)
Policy and�value (T)
Training (T)
Reanalyze
Model learns off-policy (i.e. with trajectories generated by past checkpoints)
Actor (T)
Policy and�value (T+t)
Training (T+t)
Reanalyze: storing past trajectories
Reanalyze: learning from current and past trajectories
Reanalyze: learn more from less data
Reanalyze: bootstrapping with internal feedback loops
The model uses its own predictions to create a useful learning signal
Results
The Scaling Perspective
Scaling MCTS at inference (Go)
Even if trained with fixed MCTS budget, increasing the budget at inference helps
Scaling MCTS at training (Ms. Pacman)
Increasing MCTS budget improves performance with diminishing returns�(Surprisingly, even with 6-7 simulations which is fewer than the number of actions)
Scaling MCTS at training (all 57 Atari games)
The effect of scaling MCTS does not seem universal or consistent across games
Unsolved environments
Phase Transitions
Questions