1 of 22

Lecture 13��Planning & Learning with Tabular Methods: Part 2-Additional Topics

1

Institute for Data Science & Artificial Intelligence

Lecturer: Ercan Atam

Course: DSAI 542-Reinforcement Learning

2 of 22

2

List of contents

Prioritized sweeping

Decision-time planning

Heuristic search

Rollout algoritms

Monte Carlo Tree Search

3 of 22

3

Relevant readings for this lecture

Chapter 8 of “Reinforcement Learning: An Introduction”, Richard S. Sutton and Andrew G. Barto, Second Edition, MIT Press, Cambridge, MA, 2018

4 of 22

4

In the Dyna agents presented before, simulated transitions are started in state–action pairs selected uniformly at random from all previously experienced pairs.

A uniform selection is usually not the best; planning can be much more efficient if simulated transitions and updates are focused on particular state–action pairs.

For example, consider the first maze example given before. At the beginning of the second episode, only the state–action pair leading directly into the goal has a positive value; the values of all other pairs are still zero.

This means that it is pointless to perform updates along almost all transitions, because they take the agent from one zero-valued state to another, and thus the updates would have no effect. Only an update

along a transition into the state just prior to the goal, or from it, will change any values.

Prioritized sweeping (1)

5 of 22

5

Prioritized sweeping (2)

Q: Which state-action pairs should be updated during planning?

6 of 22

6

Prioritized sweeping algorithm

7 of 22

7

Prioritized sweeping on mazes

Prioritized sweeping has been found to dramatically increase the speed at which optimal solutions are found in maze tasks, often by a factor of 5 to 10.

A typical example is shown on the left.

These data are for a sequence of maze tasks of exactly the same structure as before, except that they vary in the grid resolution.

Prioritized sweeping maintained a decisive advantage

over unprioritized Dyna-Q. Both systems made at most

n = 5 updates per environmental interaction.

8 of 22

8

Decision-time planning

Planning can be used in at least two ways: background planning and decision-time planning.

9 of 22

9

Heuristic search

It is a decision-time planning algorithm.

In heuristic search, tree-like continuations are constructed for each state encountered.

Value function of leaf nodes are approximated (using a model) and then backed up toward the current state at root.

The backing up stops at the state–action nodes for the current state.

Once the backed-up values of the state–action nodes for the current state are computed, the best of them is

chosen as the current action, and then all backed-up values are discarded.

Fig.: Heuristic search can be implemented as a sequence of one-step updates (shown here outlined in blue) backing up values from the leaf nodes toward the root.

10 of 22

10

Rollout algorithms (1)

They are decision-time planning algorithms based on Monte Carlo control applied to simulated trajectories

(using a model) that all begin at the current state.

They estimate action values for a given policy (called the “rollout policy”) by averaging the returns of many

simulated trajectories that start with each possible action and then follow the given rollout policy.

When the action-value estimates are considered to be accurate enough, the action having the highest

estimated value is executed, after which the process is carried out anew from the resulting next state.

Rollout policy

11 of 22

11

Rollout algorithms (2)

12 of 22

12

Monte Carlo Tree Search (1)

Monte Carlo Tree Search (MCTS) is a recent and strikingly successful example of decision-time, rollout

algorithm, which additionally:

accumulates value estimates from former MC simulations.
constructs a tree and uses a tree policy for traversing the tree.

MCTS is largely responsible for the improvement in the computer game “Go” from a weak amateur level in 2005 to a grandmaster level in 2015.

Directs simulations toward highly-rewarding trajectories.

Actions in the simulated trajectories are generated using a simple policy, usually called a rollout policy.

Involves multiple simulations from current state to terminal state.

Executed at each new state to select agent’s action just for that state.

13 of 22

13

Monte Carlo Tree Search (2)

Fig: Basic building blocks of MCTS algorithm.

14 of 22

14

Monte Carlo Tree Search (3)

1. Selection: Starting at root node, use a tree policy (e.g., ε-greedy or UCB rule) to travel through the tree until arriving at a leaf node.

UCB selection rule

15 of 22

15

Monte Carlo Tree Search (4)

2. Expansion: On some iterations (depending on the details of the application), the tree is expanded from the selected leaf node by adding one or more child nodes (which are reachable from the selected leaf node via unexplored actions).

16 of 22

16

Monte Carlo Tree Search (5)

1: The rollout policy could be random, pre-trained or based on model-free methods using real experience (if available), source: https://groups.uni-paderborn.de/lea/share/lehre/reinforcementlearning/lecture_slides/built/Lecture07.pdf

17 of 22

17

Monte Carlo Tree Search (6)

4. Backup: The return generated by the simulated episode is backed up to update the values (state values

and/or action values) along the traveled trajectory, but only saves those corresponding to the tree policy.

18 of 22

18

Monte Carlo Tree Search (7)

After the time is up, MCTS picks at the root node the action according to some mechanism, for example:

choosing the action with the highest action value
choosing the action with the largest visit count

After transitioning to a new state, MCTS procedure re-starts:

either constructing a new tree with the root node corresponding to the new state

or, by re-utilizing some applicable parts from the previous tree

MCTS-based algorithms are not limited to game applications but were able to achieve outstanding success

in this field.

Remarks:

(Source: https://groups.uni-paderborn.de/lea/share/lehre/reinforcementlearning/lecture_slides/built/Lecture07.pdf )

19 of 22

19

A Monte Carlo Tree Search algorithm

Start

Is current a leaf node?

No

Yes

Rollout

Yes

For each available action from current, add a new state to the tree

No

Current = first new child node

Rollout

A MCTS algorithm for selection

and expansion parts

(Source, https://www.youtube.com/watch?v=UXW2yZndl7U&t=14s)

20 of 22

20

Some good Monte Carlo Tree Search tutorials on the web

https://www.youtube.com/watch?v=UXW2yZndl7U&t=0s

Below are two good Youtube videos on MCTS.

https://www.youtube.com/watch?v=Fbs4lnGLS8M

21 of 22

21

Two most important dimensions explored in lectures 1-13

Fig: A slice through the space of reinforcement learning methods, highlighting the two of the most important dimensions explored in Lectures 1-13: the depth and width of the updates.

22 of 22

References �(utilized for preparation of lecture notes or Matlab code)

22