Lecture 2��Multi-Armed Bandits
1
Instructor: Ercan Atam
Institute for Data Science & Artificial Intelligence
Course: DSAI 542-Reinforcement Learning
2
List of contents for this lecture
3
Relevant readings for this lecture
4
k-armed bandit problem (1)
k different options, or actions. After each choice, you receive a numerical reward chosen
from a stationary probability distribution that depends on the action you selected.
over 1000 action selections, or time steps.
machine, or “one-armed bandit,” except that it has k levers instead of one.
5
k-armed bandit problem (2)
6
k-armed bandit problem (3)
7
Examples of k-armed bandit problems
8
Action-value methods
9
where
10
The 10-armed testbed (1)
11
The 10-armed testbed (2)
12
The 10-armed testbed (3)
in the long run because it often got stuck performing
suboptimal actions.
13
The 10-armed testbed (4)
14
15
Incremental implementation (1)
16
Incremental implementation (2)
17
Incremental implementation (3)
The incremental form has the following structure:
NewEstimate OldEstimate + StepSize [Target - OldEstimate]
18
A simple bandit algorithm
19
Nonstationary problems (1)
20
Nonstationary problems (2)
Constant step-size case
21
Convergence (1)
22
Convergence (2)
23
Optimistic initial values (1)
24
Optimistic initial values (2)
Idea: set initial action values to an optimistic value, a value much higher than the typical range of values
to be encountered.
Consequence: this trick leads to a simple way to encourage exploration!
25
Optimistic initial values (3)
26
Optimistic initial values (4)
27
Optimistic initial values (5)
being a generally useful approach to encouraging exploration.
28
Action selection based on upper-confidence-bound (1)
29
Action selection based on upper-confidence-bound (2)
30
Action selection based on upper-confidence-bound (3)
31
Action selection based on upper-confidence-bound (4)
32
Action selection based on upper-confidence-bound (5)
33
Gradient bandit algorithms (1)
34
Gradient bandit algorithms (2)
(see the full derivation
in the textbook)
35
Gradient bandit algorithms (3)
36
RL:
MAB:
37
Associative search (Contextual bandits) (1)
38
Associative search (Contextual bandits) (2)
Example: showing customers different webpage contents depending on “work” or “non-work (at home)” hours.
39
Which action selection method is the best for the k-armed bandits?
Figure: A parameter study of the various bandit algorithms presented in this lecture for the 10-armed tested. Each point is the average reward obtained over 1000 steps with a particular algorithm at a particular setting of its parameter.
The best for 10-armed tested
References �(utilized for preparation of lecture notes or Matlab code)
40