2 of 32

RL learning�

The most important feature distinguishing reinforcement learning from other types of learning is that it uses training information that evaluates the actions taken rather than instructs by giving correct actions.

• Evaluates actions rather instructs by giving correct actions

• Pure evaluative feedback depends totally on the action taken.

Pure instructive feedback depends not at all on the action taken.

• Evaluative feedback indicates how good the action is, but not if it is the best or worst action possible.

• Supervised learning is instructive; optimization is evaluative.

Evaluative feedback depends entirely on the action taken, whereas instructive feedback is independent of the action taken.

3 of 32

The evaluative aspect of reinforcement learning in a simplified setting, one that does not involve learning to act in more than one situation. This nonassociative setting is the one in which most prior work involving evaluative feedback has been done, and it avoids much of the complexity of the full reinforcement learning problem. Studying this case enables us to see most clearly how evaluative feedback differs from, and yet can be combined with, instructive feedback.

6 of 32

If you maintain estimates of the action values, then at any time step there is at least one action whose estimated value is greatest. We call these the greedy actions.
When you select one of these actions, we say that you are exploiting your current knowledge of the values of the actions.
If instead you select one of the nongreedy actions, then we say you are exploring, because this enables you to improve your estimate of the nongreedy action’s value. Exploitation is the right thing to do to maximize the expected reward on the one step, but exploration may produce the greater total reward in the long run.

7 of 32

Balancing exploitation and exploration is a key challenge in the multi-armed bandit problem. If the agent focuses too much on exploitation, it may get stuck on a sub-optimal arm and miss out on the best arm. Conversely, if the agent focuses too much on exploration, it may waste time on low-reward arms and miss out on the cumulative reward from the best arm.

In summary, exploitation and exploration are two competing strategies for selecting arms in the multi-armed bandit problem. Balancing these strategies is essential to achieve the best overall performance.

19 of 32

Upper Confidence Bounds(UCB)�

Upper Confidence Bounds (UCB) is a popular algorithm for solving the multi-armed bandit problem.

It is similar to the ε-greedy algorithm but uses a different strategy for balancing exploration and exploitation.

In UCB, the algorithm selects the arm with the highest upper confidence bound at each time step.

The upper confidence bound is a measure of how uncertain the algorithm is about the reward distribution for that arm. It is calculated as follows:

UCB = estimated mean reward + exploration bonus

The exploration bonus is a function of the number of times the arm has been selected and the total number of times all arms have been selected.

It is designed to encourage exploration of arms that have been selected less frequently, while still selecting arms with high estimated rewards.

20 of 32

The UCB algorithm starts by selecting each arm once to establish initial estimates of the reward distribution.

It then selects the arm with the highest upper confidence bound at each subsequent time step.

As more data is collected, the estimates of the reward distribution become more accurate and the algorithm becomes more confident in its arm selection.

With UCB, ‘Aₜ’, the action chosen at time step ‘t’, is given by:

where;

Qₜ(a) is the estimated value of action ‘a’ at time step ‘t’.
Nₜ(a) is the number of times that action ‘a’ has been selected, prior to time ‘t’.
‘c’ is a confidence value that controls the level of exploration.

One advantage of UCB over the ε-greedy algorithm is that it does not require a parameter to specify the level of exploration. The algorithm automatically adjusts the level of exploration based on the uncertainty of the reward distribution for each arm.

21 of 32

Thompson Sampling(The Bernoulli bandit)

A variant of the UCB algorithm is the Thompson Sampling algorithm.

This algorithm involves the gambler choosing a slot machine based on a probability distribution that is updated based on the results of previous plays.

The probability distribution is updated using Bayesian inference, which allows the gambler to incorporate new information about the payout probabilities of the slot machines.

In Thompson Sampling, the algorithm starts by assigning prior probability distributions to the reward distributions for each arm.

These prior distributions represent the agent's initial beliefs about the reward distribution for each arm and can be updated as the algorithm gathers more data.

26 of 32

At each time step, the algorithm samples a reward distribution for each arm from its prior probability distribution.

It then selects the arm with the highest sampled reward and receives a reward based on the true reward distribution for that arm.

The algorithm updates its prior probability distribution for the selected arm based on the observed reward, which in turn affects the probability distribution for the next round.

27 of 32

nonassociative tasks, that is, tasks in which there is no need to associate different actions with different situations. In these tasks the learner either tries to find a single best action when the task is stationary, or tries to track the best action as it changes over time when the task is nonstationary. that when a bandit task is selected for you, you are given some distinctive clue about its identity (but not its action values).
Maybe you are facing an actual slot machine that changes the color of its display as it changes its action values. Now you can learn a policy associating each task, signaled by the color you see, with the best action to take when facing that task—for instance, if red, select arm 1; if green, select arm 2. With the right policy you can usually do much better than you could in the absence of any information distinguishing one bandit task from another. This is an example of an associative search task, so called because it involves both trial-and-error learning to search for the best actions, and association of these actions with the situations in which they are best.

28 of 32

Associative search tasks are often now called contextual bandits in the literature. Associative search tasks are intermediate between the k-armed bandit problem and the full reinforcement learning problem. They are like the full reinforcement learning problem in that they involve learning a policy, but like our version of the k-armed bandit problem in that each action affects only the immediate reward. If actions are allowed to affect the next situation as well as the reward, then we have the full reinforcement learning problem.

31 of 32

Nonassociative Tasks:
Definition: These are tasks where you don't need to connect different actions with different situations.
Types:

Stationary: The task stays the same, and you need to find the best action to take.
Nonstationary: The task changes over time, and you need to keep track of the best action as it changes.

Example: Imagine you have a slot machine with a display that changes color. Each color signals which slot machine arm is best to pull. For instance:
Red Color: Best to pull arm 1.
Green Color: Best to pull arm 2.
Associative Search Task:
Definition: This is a task where you need to:

Experiment to find out which actions work best.
Learn which action is best for each specific situation (like each color on the slot machine).

By associating the color (situation) with the best action, you can make better decisions and perform better than if you didn’t have any clues about the different tasks.

32 of 32

Contextual Bandits (formerly called Associative Search Tasks):
Definition: These tasks are a middle ground between two types of problems: the k-armed bandit problem and full reinforcement learning.
Key Points:
Like Full Reinforcement Learning: You need to learn a strategy or policy.
Like k-Armed Bandit Problem: Each action only impacts the immediate reward, not the future situation.
Difference from Full Reinforcement Learning:
In full reinforcement learning, actions not only affect the reward you get but also change the situation for the next steps.
In contextual bandits, the action only influences the immediate reward and the current situation remains unchanged for future actions.

1 of 32

2 of 32

3 of 32

4 of 32

5 of 32

6 of 32

7 of 32

8 of 32

9 of 32

10 of 32

11 of 32

12 of 32

13 of 32

14 of 32

15 of 32

16 of 32

17 of 32

18 of 32

19 of 32

20 of 32

21 of 32

22 of 32

23 of 32

24 of 32

25 of 32

26 of 32

27 of 32

28 of 32

29 of 32

30 of 32

31 of 32

32 of 32