Lecture 13��Planning & Learning with Tabular Methods: Part 2-Additional Topics
1
Institute for Data Science & Artificial Intelligence
Lecturer: Ercan Atam
Course: DSAI 542-Reinforcement Learning
2
List of contents
3
Relevant readings for this lecture
4
along a transition into the state just prior to the goal, or from it, will change any values.
Prioritized sweeping (1)
5
Prioritized sweeping (2)
Q: Which state-action pairs should be updated during planning?
6
Prioritized sweeping algorithm
7
Prioritized sweeping on mazes
over unprioritized Dyna-Q. Both systems made at most
n = 5 updates per environmental interaction.
8
Decision-time planning
Planning can be used in at least two ways: background planning and decision-time planning.
9
Heuristic search
chosen as the current action, and then all backed-up values are discarded.
Fig.: Heuristic search can be implemented as a sequence of one-step updates (shown here outlined in blue) backing up values from the leaf nodes toward the root.
10
Rollout algorithms (1)
(using a model) that all begin at the current state.
simulated trajectories that start with each possible action and then follow the given rollout policy.
estimated value is executed, after which the process is carried out anew from the resulting next state.
Rollout policy
11
Rollout algorithms (2)
12
Monte Carlo Tree Search (1)
algorithm, which additionally:
13
Monte Carlo Tree Search (2)
Fig: Basic building blocks of MCTS algorithm.
14
Monte Carlo Tree Search (3)
1. Selection: Starting at root node, use a tree policy (e.g., ε-greedy or UCB rule) to travel through the tree until arriving at a leaf node.
UCB selection rule
15
Monte Carlo Tree Search (4)
2. Expansion: On some iterations (depending on the details of the application), the tree is expanded from the selected leaf node by adding one or more child nodes (which are reachable from the selected leaf node via unexplored actions).
16
Monte Carlo Tree Search (5)
1: The rollout policy could be random, pre-trained or based on model-free methods using real experience (if available), source: https://groups.uni-paderborn.de/lea/share/lehre/reinforcementlearning/lecture_slides/built/Lecture07.pdf
17
Monte Carlo Tree Search (6)
4. Backup: The return generated by the simulated episode is backed up to update the values (state values
and/or action values) along the traveled trajectory, but only saves those corresponding to the tree policy.
18
Monte Carlo Tree Search (7)
in this field.
Remarks:
19
A Monte Carlo Tree Search algorithm
Start
Is current a leaf node?
No
Yes
Rollout
Yes
For each available action from current, add a new state to the tree
No
Current = first new child node
Rollout
A MCTS algorithm for selection
and expansion parts
20
Some good Monte Carlo Tree Search tutorials on the web
Below are two good Youtube videos on MCTS.
21
Two most important dimensions explored in lectures 1-13
Fig: A slice through the space of reinforcement learning methods, highlighting the two of the most important dimensions explored in Lectures 1-13: the depth and width of the updates.
References �(utilized for preparation of lecture notes or Matlab code)
22