1 of 40

Lecture 2��Multi-Armed Bandits

1

Instructor: Ercan Atam

Institute for Data Science & Artificial Intelligence

Course: DSAI 542-Reinforcement Learning

2 of 40

2

List of contents for this lecture

3 of 40

3

Relevant readings for this lecture

Chapter 2 of “Reinforcement Learning: An Introduction”, Richard S. Sutton and Andrew G. Barto, Second Edition, MIT Press, Cambridge, MA, 2018

4 of 40

4

k-armed bandit problem (1)

Consider the following learning problem. You are faced repeatedly with a choice among

k different options, or actions. After each choice, you receive a numerical reward chosen

from a stationary probability distribution that depends on the action you selected.

Your objective is to maximize the expected total reward over some time period, for example,

over 1000 action selections, or time steps.

This is the original form of the “k-armed bandit problem”, so named by analogy to a slot

machine, or “one-armed bandit,” except that it has k levers instead of one.

5 of 40

5

k-armed bandit problem (2)

6 of 40

6

k-armed bandit problem (3)

7 of 40

7

Examples of k-armed bandit problems

Design of experiments (clinical trials)

Online ad placement

Web page personalization

Games

Networks (packet routing)

…

8 of 40

8

Action-value methods

Recall that the true value of an action is the mean (expected) reward when that action is selected. One natural way to estimate this is by averaging the rewards actually received:

9 of 40

9

A simple way to balance exploration and exploitation.

In the limit actions will be sampled infinite amount of times.

It will converge.

where

10 of 40

10

The 10-armed testbed (1)

11 of 40

11

The 10-armed testbed (2)

12 of 40

12

The 10-armed testbed (3)

The upper graph shows the increase in expected reward with experience.

The greedy method improved slightly faster than the other methods at the very beginning, but then leveled off at a lower level.

The greedy method achieved a reward-per-step of only about 1, compared with the best possible of about 1.54 on this testbed.

The greedy method performed significantly worse

in the long run because it often got stuck performing

suboptimal actions.

13 of 40

13

The 10-armed testbed (4)

14 of 40

14

15 of 40

15

Incremental implementation (1)

16 of 40

16

Incremental implementation (2)

17 of 40

17

Incremental implementation (3)

The incremental form has the following structure:

NewEstimate OldEstimate + StepSize [Target - OldEstimate]

18 of 40

18

A simple bandit algorithm

19 of 40

19

Nonstationary problems (1)

20 of 40

20

Nonstationary problems (2)

Constant step-size case

We call this a “weighted average” because the sum of the weights is

21 of 40

21

Convergence (1)

22 of 40

22

Convergence (2)

23 of 40

23

Optimistic initial values (1)

24 of 40

24

Optimistic initial values (2)

Idea: set initial action values to an optimistic value, a value much higher than the typical range of values

to be encountered.

Consequence: this trick leads to a simple way to encourage exploration!

25 of 40

25

Optimistic initial values (3)

26 of 40

26

Optimistic initial values (4)

27 of 40

27

Optimistic initial values (5)

Initially, the optimistic method performs worse because it explores more, but eventually it performs better because its exploration decreases with time.

We call this technique for encouraging exploration “optimistic initial values”.

We regard it as a simple trick that can be quite effective on stationary problems, but it is far from

being a generally useful approach to encouraging exploration.

For example, it is not well suited to nonstationary problems because its drive for exploration is inherently temporary.

If the task changes, creating a renewed need for exploration, this method cannot help.

28 of 40

28

Action selection based on upper-confidence-bound (1)

29 of 40

29

Action selection based on upper-confidence-bound (2)

30 of 40

30

Action selection based on upper-confidence-bound (3)

31 of 40

31

Action selection based on upper-confidence-bound (4)

32 of 40

32

Action selection based on upper-confidence-bound (5)

33 of 40

33

Gradient bandit algorithms (1)

34 of 40

34

Gradient bandit algorithms (2)

(see the full derivation

in the textbook)

35 of 40

35

Gradient bandit algorithms (3)

Below are the results with the gradient bandit algorithm on a variant of the 10-armed testbed in which the true expected rewards were selected according to a normal distribution with a mean of +4 instead of zero (and with unit variance as before).

36 of 40

36

RL:

MAB:

This difference comes from the fact that in MAB environment time steps involve entirely separate interactions.

37 of 40

37

Associative search (Contextual bandits) (1)

So far nonassociative tasks: no different situations, single state.

Find the single best action when stationary

Or try to find the best action as it changes over time when nonstationary

In full RL (or in short, RL), in general we have different situations: find best action for each different situation (=state).

Contextual bandits (associative tasks) are cases between the “k-armed bandits” and the “full RL”.

38 of 40

38

Associative search (Contextual bandits) (2)

This new situation is like RL: requires search for a policy in order to handle each case, but also like k-armed: each action only affects the immediate reward, and there is no connection between states, hence no state transition.

If actions affect next state and reward, then the problem is a full RL problem.

Example: showing customers different webpage contents depending on “work” or “non-work (at home)” hours.

39 of 40

39

Which action selection method is the best for the k-armed bandits?

Figure: A parameter study of the various bandit algorithms presented in this lecture for the 10-armed tested. Each point is the average reward obtained over 1000 steps with a particular algorithm at a particular setting of its parameter.

The best for 10-armed tested

40 of 40

References �(utilized for preparation of lecture notes or Matlab code)

40