Lecture 7��Monte Carlo Methods: Part 1- Introduction to MC Methods, MC prediction and control
1
Instructor: Ercan Atam
Institute for Data Science & Artificial Intelligence
Course: DSAI 542-Reinforcement Learning
2
List of contents for this lecture
3
Relevant readings for this lecture
4
Monte Carlo methods (1)
terminate no matter what actions are selected.
policies are changed.
step-by-step (online) sense.
5
Monte Carlo methods (2)
6
Monte Carlo methods (3)
7
Monte Carlo methods vs dynamic programming
(Source, derivative of W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
Dynamic programming:
Monte Carlo methods:
8
Monte Carlo prediction (1)
MC prediction problem
…
9
Monte Carlo prediction (2)
10
Monte Carlo prediction (3)
(terminal)
11
First-visit and every-visit MC prediction methods
Definition: first-visit MC prediction method
In the first-visit MC prediction method, when you are at state “s” you start to sum the rewards
until the end of the episode.
If the state “s” appears again during a transition, you ignore re-appearance and don't start
counting again. For state “s”, we have maximum one sum per episode.
Then, the value of state “s” is taken as the average of the sums corresponding to all episodes
where “s” appears.
Definition: every-visit MC prediction method
In the every-visit MC prediction method, for every re-appearance of “s” in an episode you
start a new sum.
For state “s”, we can have multiple sums per episode.
Then, the value of state “s” is taken as the average of all sums corresponding to all episodes
where “s” appears.
12
Episode 1
Episode 2
First-visit
Every-visit
First-visit and every-visit MC prediction methods: example
13
Pseudocode for first-visit MC prediction
14
Pseudocode for every-visit MC prediction
15
Incremental implementation (1)
(Source, derivative of W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
16
Incremental implementation (2)
(Source, derivative of W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
17
Statistical properties of MC-based prediction (1)
(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
18
Statistical properties of MC-based prediction (2)
(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
19
Backup diagrams: DP vs MC
Fig.: Backup diagrams for DP (left) and MC (right) prediction: shallow one-step backups with bootstrapping vs. deep backups over full epsiodes.
20
MC-based prediction example: forest tree MDP (1)
(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
21
MC-based prediction example: forest tree MDP (2)
(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)
Fig. : State-value estimate of forest tree MDP for initial state using MC-based prediction over the number of episodes being evaluated (mean and standard deviation are calculated based on 2000 independent runs)
22
MC-based prediction example: blackjack game (1)
23
Card | Value |
J/Q/K | 10 |
2-10 | face value |
Ace | 1 or 11 |
MC-based prediction example: blackjack game (2)
24
dealer also has a natural, in which case the game is a draw.
either stops (“sticks”) or exceeds 21 (“goes bust”).
greater, and hits otherwise.
determined by whose final sum is closer to 21.
Game rules:
MC-based prediction example: blackjack game (3)
25
Assumptions:
MC-based prediction example: blackjack game (4)
26
States:
Actions: 2 actions: hit (=request for additional card) or stick (=end your turn).
Rewards: 3 rewards: +1 (win), -1 (lose), 0 (draw). All rewards are given at the end of a game.
All rewards within a game are zero.
s=(player’s current total, dealer’s showing card, whether holding a usable ace)
Total number of states is 200 because:
- player’s current sum is in the range [12-21] (we exclude a sum lower than 12 as in
those scenarios player would always hit. If current sum is 11, then still he will choose
to hit because in case he picks up an ace, he will count its value as 1) 🡺 in total 10 cases.
- the dealer's showing card [ace-10] (J,Q,K count as 10 as well) 🡺 in total 10 cases.
- whether player counts an ace as 1 or 11 🡺 in total 2 cases
So, total number of state scenarios is 10x10x2=200.
MC-based prediction example: blackjack game (5)
27
Results:
Fig: Approximate state-value functions of the blackjack game for the player’s policy that he/she sticks only on 20 or 21, computed by first-visit Monte Carlo policy evaluation.
Note: Matlab files for this example are located in Blackjack\prediction
MC-based prediction example: blackjack game (6)
28
Remarks:
easy to apply DP methods to compute the state-value function because the distribution of next
events (the environments dynamics) is not easy to compute in this case.
terminating with a reward of +1 as a function of the dealer’s showing card?
MC-based prediction example: blackjack game (7)
29
transitions, which is the case in DP. This is a huge advantage in practice.
does not build upon the estimate of any other state, as is the case in DP.
of states. This can make Monte Carlo methods particularly attractive when one requires the value
of only one or a subset of states.
only these states, ignoring all others.
Advantages of MC learning methods
30
Disadvantages of MC learning methods
DP estimation
MC estimation
31
one simply looks ahead one step and chooses whichever action leads to the best combination of
reward and next state, as we did in the lecture on DP:
Monte Carlo estimation of action values (1)
32
Monte Carlo estimation of action values (2)
33
Remarks:
Monte Carlo estimation of action values (3)
34
Complications in Monte Carlo estimation of action values (1)
35
Complications in Monte Carlo estimation of action values (2)
A must:
36
pair, and that every pair has a nonzero probability of being selected as the start. This assumption is
called “exploring starts”.
number of times in the limit of an infinite number of episodes.
interaction with the environment, exploring starts may not be possible to satisfy.
action in every state is non-zero.
method.
A remedy to complications in Monte Carlo estimation of action values
37
Algorithm for Monte Carlo estimation of action values
38
Objective: To use Monte Carlo estimation methods for approximation of optimal policies under the assumption of exploring starts.
Monte Carlo control with exploring starts (1)
39
First, we will consider the MC version of classical policy iteration:
Monte Carlo control with exploring starts (2)
E
: a complete policy evalution
I
: a complete policy improvement
40
Monte Carlo control with exploring starts (3)
41
Monte Carlo control with exploring starts (4)
42
Monte Carlo control with exploring starts (5)
43
Monte Carlo control with exploring starts (6)
evaluation
improvement
44
policy improvement on an episode-by-episode basis.
improved at all the states visited in the episode.
Monte Carlo control with exploring starts (7)
45
First-visit MC control with exploring starts algorithm
46
Every-visit MC control with exploring starts algorithm
47
Fig.: Monte Carlo control with exploring starts (using every-visit method) for the blackjack game
(results with 5 million simulations)
Note: Matlab files for this
example are located in
Blackjack\control_with_es
MC control with exploring starts: blackjack game example
References �(utilized for preparation of lecture notes or Matlab code)
48