1 of 48

Lecture 7��Monte Carlo Methods: Part 1- Introduction to MC Methods, MC prediction and control

1

Instructor: Ercan Atam

Institute for Data Science & Artificial Intelligence

Course: DSAI 542-Reinforcement Learning

2 of 48

2

List of contents for this lecture

Monte Carlo idea and differences with dynamic programming

Monte Carlo prediction

Monte Carlo control

3 of 48

3

Relevant readings for this lecture

Chapter 5 of “Reinforcement Learning: An Introduction”, Richard S. Sutton and Andrew G. Barto, Second Edition, MIT Press, Cambridge, MA, 2018

4 of 48

4

Monte Carlo methods (1)

5 of 48

That is, we assume that experience is divided into episodes, and that all episodes eventually

terminate no matter what actions are selected.

In Monte Carlo methods, only on the completion of an episode value function estimates and

policies are changed.

Monte Carlo methods can thus be incremental in an episode-by-episode sense, but not in a

step-by-step (online) sense.

5

Monte Carlo methods (2)

6 of 48

Monte Carlo methods can be used in two ways:

Model-free: No model necessary and still attains optimality.

Simulated: Needs only a simulation, not a “full” model. Simulation model needs only to generate sample transitions, not the complete probability distribution.

6

Monte Carlo methods (3)

7 of 48

7

Monte Carlo methods vs dynamic programming

(Source, derivative of W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

Dynamic programming:

Model-based prediction and control

Planning for known MDPs

Monte Carlo methods:

Model-free prediction and control

Estimating value functions and optimize policies for unknown MDPs

But: still assuming a finite MDP structure (or problems close to that)

They represent, in general, a broad class of computational algorithms relying on repeated random sampling to obtain numerical results

8 of 48

8

Monte Carlo prediction (1)

MC prediction problem

…

9 of 48

9

Monte Carlo prediction (2)

10 of 48

10

Monte Carlo prediction (3)

(terminal)

11 of 48

11

First-visit and every-visit MC prediction methods

Definition: first-visit MC prediction method

In the first-visit MC prediction method, when you are at state “s” you start to sum the rewards

until the end of the episode.

If the state “s” appears again during a transition, you ignore re-appearance and don't start

counting again. For state “s”, we have maximum one sum per episode.

Then, the value of state “s” is taken as the average of the sums corresponding to all episodes

where “s” appears.

Definition: every-visit MC prediction method

In the every-visit MC prediction method, for every re-appearance of “s” in an episode you

start a new sum.

For state “s”, we can have multiple sums per episode.

Then, the value of state “s” is taken as the average of all sums corresponding to all episodes

where “s” appears.

12 of 48

12

Episode 1

Episode 2

First-visit

Every-visit

First-visit and every-visit MC prediction methods: example

13 of 48

13

Pseudocode for first-visit MC prediction

14 of 48

14

Pseudocode for every-visit MC prediction

15 of 48

15

Incremental implementation (1)

(Source, derivative of W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

16 of 48

16

Incremental implementation (2)

(Source, derivative of W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

17 of 48

17

Statistical properties of MC-based prediction (1)

(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

18 of 48

18

Statistical properties of MC-based prediction (2)

(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

State-value estimates for each state are independent.

One estimate does not rely on the estimate of other states (no bootstrapping compared to DP).

This makes MC particularly attractive when one requires state-value knowledge of only one or few states.

Hence, generating episodes starting from the state(s) of interest.

19 of 48

19

Whereas the DP diagram shows all possible transitions, the Monte Carlo diagram shows only those sampled on the episodes.

MC goes all the way to the end of episodes (whereas, DP includes only one-step transition).

Backup diagrams: DP vs MC

Fig.: Backup diagrams for DP (left) and MC (right) prediction: shallow one-step backups with bootstrapping vs. deep backups over full epsiodes.

20 of 48

20

MC-based prediction example: forest tree MDP (1)

(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

21 of 48

21

MC-based prediction example: forest tree MDP (2)

(Source, W. Kirchgässner, M. Schenke, O. Wallscheid and D. Weber, RL Course Material, Paderborn University, 2020)

Fig. : State-value estimate of forest tree MDP for initial state using MC-based prediction over the number of episodes being evaluated (mean and standard deviation are calculated based on 2000 independent runs)

22 of 48

22

A card game between a “dealer” and a “player”.

Objective: Obtain cards whose numerical value is as great as possible without exceeding 21.

MC-based prediction example: blackjack game (1)

23 of 48

23

Card	Value
J/Q/K	10
2-10	face value
Ace	1 or 11

MC-based prediction example: blackjack game (2)

24 of 48

24

The game begins with two cards dealt to both dealer and player. One of the dealer’s cards is face up and the other is face down.

If the player has 21 immediately (an ace and a 10-card), it is called a “natural”. He then wins unless the

dealer also has a natural, in which case the game is a draw.

If the player does not have a natural, then he can request additional cards one by one (“hits”), until he

either stops (“sticks”) or exceeds 21 (“goes bust”).

If player goes bust, he loses; if he sticks, then it becomes the dealer’s turn.

The dealer hits or sticks according to a fixed strategy without choice: he sticks on any sum of 17 or

greater, and hits otherwise.

If the dealer goes bust, then the player wins; otherwise, the outcome — win, lose, or draw — is

determined by whose final sum is closer to 21.

If the player holds an ace that he could count as 11 without going bust, then the ace is said to be “usable”.

Game rules:

MC-based prediction example: blackjack game (3)

25 of 48

25

Assumptions:

MC-based prediction example: blackjack game (4)

26 of 48

26

States:

Actions: 2 actions: hit (=request for additional card) or stick (=end your turn).

Rewards: 3 rewards: +1 (win), -1 (lose), 0 (draw). All rewards are given at the end of a game.

All rewards within a game are zero.

s=(player’s current total, dealer’s showing card, whether holding a usable ace)

Total number of states is 200 because:

- player’s current sum is in the range [12-21] (we exclude a sum lower than 12 as in

those scenarios player would always hit. If current sum is 11, then still he will choose

to hit because in case he picks up an ace, he will count its value as 1) 🡺 in total 10 cases.

- the dealer's showing card [ace-10] (J,Q,K count as 10 as well) 🡺 in total 10 cases.

- whether player counts an ace as 1 or 11 🡺 in total 2 cases

So, total number of state scenarios is 10x10x2=200.

MC-based prediction example: blackjack game (5)

27 of 48

27

Results:

Fig: Approximate state-value functions of the blackjack game for the player’s policy that he/she sticks only on 20 or 21, computed by first-visit Monte Carlo policy evaluation.

Note: Matlab files for this example are located in Blackjack\prediction

MC-based prediction example: blackjack game (6)

28 of 48

28

Remarks:

Although we have complete knowledge of the environment in the blackjack task, it would not be

easy to apply DP methods to compute the state-value function because the distribution of next

events (the environments dynamics) is not easy to compute in this case.

For example, suppose the player’s sum is 14 and he chooses to stick. What is his probability of

terminating with a reward of +1 as a function of the dealer’s showing card?

In contrast, generating the sample games required by Monte Carlo methods is easy in this case.

MC-based prediction example: blackjack game (7)

29 of 48

29

Monte Carlo methods do not require a complete knowledge of the environment or all possible

transitions, which is the case in DP. This is a huge advantage in practice.

In Monte Carlo methods, estimates for each state are independent. The estimate for one state

does not build upon the estimate of any other state, as is the case in DP.

The computational expense of estimating the value of a single state is independent of the number

of states. This can make Monte Carlo methods particularly attractive when one requires the value

of only one or a subset of states.

One can generate many sample episodes starting from the states of interest, averaging returns from

only these states, ignoring all others.

Advantages of MC learning methods

30 of 48

30

MC goes all the way to the end of the episode (whereas, DP includes only one-step transition).

When episodes are long, learning can be slow in Monte Carlo methods:

One has to wait until an episode ends before learning happens.

Returns can have high variance.

Disadvantages of MC learning methods

DP estimation

MC estimation

31 of 48

31

With a model, state values alone are sufficient to determine a policy (improve an existing policy):

one simply looks ahead one step and chooses whichever action leads to the best combination of

reward and next state, as we did in the lecture on DP:

Monte Carlo estimation of action values (1)

32 of 48

32

Monte Carlo estimation of action values (2)

33 of 48

33

Remarks:

Monte Carlo estimation of action values (3)

34 of 48

34

Complications in Monte Carlo estimation of action values (1)

35 of 48

35

Complications in Monte Carlo estimation of action values (2)

A must:

For policy evaluation to work for action values, instead of just one action at a state, we must assure continual exploration for all other possible actions at that state as well!

36 of 48

36

One option for assuring continual exploration is by specifying that the episodes start in a state–action

pair, and that every pair has a nonzero probability of being selected as the start. This assumption is

called “exploring starts”.

The assumption of exploring starts guarantees that all state–action pairs will be visited an infinite

number of times in the limit of an infinite number of episodes.

But the assumption of exploring starts doesn’t always work. For example, if we learn directly from the

interaction with the environment, exploring starts may not be possible to satisfy.

A more common way to go about it, is to only consider stochastic policies where the probability of every

action in every state is non-zero.

We will discuss two important variants of this approach later on.

For now, we retain the assumption of exploring starts and complete the presentation of a full MC control

method.

A remedy to complications in Monte Carlo estimation of action values

37 of 48

37

Algorithm for Monte Carlo estimation of action values

38 of 48

38

Objective: To use Monte Carlo estimation methods for approximation of optimal policies under the assumption of exploring starts.

Monte Carlo control with exploring starts (1)

39 of 48

39

First, we will consider the MC version of classical policy iteration:

Monte Carlo control with exploring starts (2)

E

: a complete policy evalution

I

: a complete policy improvement

40 of 48

40

Monte Carlo control with exploring starts (3)

41 of 48

41

Monte Carlo control with exploring starts (4)

42 of 48

42

Note that we made two unlikely assumptions in order to easily obtain the guarantee of convergence for the Monte Carlo methods:

Assumption 1: episodes have exploring starts.

Assumption 2: policy evaluation could be done with an infinite number of episodes.

To obtain a practical algorithm we will have to remove both assumptions. We postpone the removal of the first assumption (exploring starts) until later in Lecture 8. I.e., we keep Assumption1 for a while.

Monte Carlo control with exploring starts (5)

43 of 48

43

Monte Carlo control with exploring starts (6)

evaluation

improvement

44 of 48

44

For Monte Carlo-based policy iteration, it is natural to alternate between policy evaluation and

policy improvement on an episode-by-episode basis.

After each episode, the observed returns are used for policy evaluation, and then the policy is

improved at all the states visited in the episode.

A complete algorithm based on this principle is given next.

Monte Carlo control with exploring starts (7)

45 of 48

45

First-visit MC control with exploring starts algorithm

46 of 48

46

Every-visit MC control with exploring starts algorithm

47 of 48

47

Fig.: Monte Carlo control with exploring starts (using every-visit method) for the blackjack game

(results with 5 million simulations)

Note: Matlab files for this

example are located in

Blackjack\control_with_es

MC control with exploring starts: blackjack game example

48 of 48

References �(utilized for preparation of lecture notes or Matlab code)

48