JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 45

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Presented by Rafael Hernandez and Andrei Mircea

2 of 45

3 of 45

General Presentation Outline:

Previous Versions and Introduction of the Algorithm
Main RL Functionalities within MuZero
Results with Regards to Scaling Laws

4 of 45

5 of 45

6 of 45

What are the previous versions of MuZero?

7 of 45

8 of 45

9 of 45

10 of 45

How can Monte Carlo Tree Search be used to plan with the MuZero neural networks?

11 of 45

12 of 45

13 of 45

How does MuZero learn from its environment?

14 of 45

15 of 45

Unrolling the model alongside the collected experience?

16 of 45

17 of 45

Putting all the pieces together?

18 of 45

19 of 45

Method

Updating the Learned Model

20 of 45

Aligning training with episode generation

21 of 45

Policy loss

22 of 45

Reward loss

23 of 45

Value loss

24 of 45

Method

Reanalyze

25 of 45

Reanalyze: continued learning from past trajectories

26 of 45

Normal training

27 of 45

Normal training

Model learns on-policy (i.e. with trajectories generated by its policy)

Actor (T)

Policy and�value (T)

Training (T)

28 of 45

Reanalyze

Model learns off-policy (i.e. with trajectories generated by past checkpoints)

Actor (T)

Policy and�value (T+t)

Training (T+t)

29 of 45

Reanalyze: storing past trajectories

30 of 45

Reanalyze: learning from current and past trajectories

31 of 45

Reanalyze: learn more from less data

Increase replay 10x
Learn from 10% new trajectories and 90% reanalyzed trajectories
Result: 100x fewer frames

32 of 45

Reanalyze: bootstrapping with internal feedback loops

The model uses its own predictions to create a useful learning signal

33 of 45

Results

The Scaling Perspective

34 of 45

Scaling MCTS at inference (Go)

Even if trained with fixed MCTS budget, increasing the budget at inference helps

35 of 45

Scaling MCTS at training (Ms. Pacman)

Increasing MCTS budget improves performance with diminishing returns�(Surprisingly, even with 6-7 simulations which is fewer than the number of actions)