1 of 45

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Presented by Rafael Hernandez and Andrei Mircea

2 of 45

3 of 45

General Presentation Outline:

  1. Previous Versions and Introduction of the Algorithm
  2. Main RL Functionalities within MuZero
  3. Results with Regards to Scaling Laws

4 of 45

5 of 45

6 of 45

What are the previous versions of MuZero?

7 of 45

8 of 45

9 of 45

10 of 45

How can Monte Carlo Tree Search be used to plan with the MuZero neural networks?

11 of 45

12 of 45

13 of 45

How does MuZero learn from its environment?

14 of 45

15 of 45

Unrolling the model alongside the collected experience?

16 of 45

17 of 45

Putting all the pieces together?

18 of 45

19 of 45

Method

Updating the Learned Model

20 of 45

Aligning training with episode generation

21 of 45

Policy loss

22 of 45

Reward loss

23 of 45

Value loss

24 of 45

Method

Reanalyze

25 of 45

Reanalyze: continued learning from past trajectories

26 of 45

Normal training

27 of 45

Normal training

Model learns on-policy (i.e. with trajectories generated by its policy)

Actor (T)

Policy and�value (T)

Training (T)

28 of 45

Reanalyze

Model learns off-policy (i.e. with trajectories generated by past checkpoints)

Actor (T)

Policy and�value (T+t)

Training (T+t)

29 of 45

Reanalyze: storing past trajectories

30 of 45

Reanalyze: learning from current and past trajectories

31 of 45

Reanalyze: learn more from less data

  • Increase replay 10x
  • Learn from 10% new trajectories and 90% reanalyzed trajectories
  • Result: 100x fewer frames

32 of 45

Reanalyze: bootstrapping with internal feedback loops

The model uses its own predictions to create a useful learning signal

33 of 45

Results

The Scaling Perspective

34 of 45

Scaling MCTS at inference (Go)

Even if trained with fixed MCTS budget, increasing the budget at inference helps

35 of 45

Scaling MCTS at training (Ms. Pacman)

Increasing MCTS budget improves performance with diminishing returns�(Surprisingly, even with 6-7 simulations which is fewer than the number of actions)

36 of 45

Scaling MCTS at training (all 57 Atari games)

The effect of scaling MCTS does not seem universal or consistent across games

37 of 45

Unsolved environments

38 of 45

Phase Transitions

39 of 45

Questions

40 of 45

41 of 45

42 of 45

43 of 45

44 of 45

45 of 45