4 of 38

The reward hypothesis: the central idea of Reinforcement Learning
⇒ Why is the goal of the agent to maximize the expected return?
Because RL is based on the reward hypothesis, which is that all goals can be described as the maximization of the expected return (expected cumulative reward).
That’s why in Reinforcement Learning, to have the best behavior, we aim to learn to take actions that maximize the expected cumulative reward.
Markov Property

5 of 38

Markov Property
The Markov Property implies that our agent needs only the current state to decide what action to take and not the history of all the states and actions they took before.
Observations/States Space
Observations/States are the information our agent gets from the environment. In the case of a video game, it can be a frame (a screenshot). In the case of the trading agent, it can be the value of a certain stock, etc.

6 of 38

There is a differentiation to make between observation and state, however:

State s: is a complete description of the state of the world (there is no hidden information). In a fully observed environment.

In chess game, we receive a state from the environment since we have access to the whole check board information.

In a chess game, we have access to the whole board information, so we receive a state from the environment. In other words, the environment is fully observed.

7 of 38

Observation o: is a partial description of the state. In a partially observed environment.

In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.

In Super Mario Bros, we are in a partially observed environment. We receive an observation since we only see a part of the level.

9 of 38

Action Space

The Action space is the set of all possible actions in an environment.

The actions can come from a discrete or continuous space:

Discrete space: the number of possible actions is finite.

In Super Mario Bros, we have only 4 possible actions: left, right, up (jumping) and down (crouching).

Again, in Super Mario Bros, we have a finite set of actions since we have only 4 directions.

10 of 38

Continuous space: the number of possible actions is infinite

A Self Driving Car agent has an infinite number of possible actions since it can turn left 20°, 21,1°, 21,2°, honk, turn right 20°

11 of 38

Rewards and the discounting

The reward is fundamental in RL because it’s the only feedback for the agent. Thanks to it, our agent knows if the action taken was good or not.

The cumulative reward at each time step t can be written as:

The cumulative reward equals the sum of all rewards in the sequence.

Which is equivalent to:

The cumulative reward = rt+1 (rt+k+1 = rt+0+1 = rt+1)+ rt+2 (rt+k+1 = rt+1+1 = rt+2) +

12 of 38

Let’s say your agent is this tiny mouse that can move one tile each time step, and your opponent is the cat (that can move too). The mouse’s goal is to eat the maximum amount of cheese before being eaten by the cat.

�

13 of 38

As we can see in the diagram, it’s more probable to eat the cheese near us than the cheese close to the cat (the closer we are to the cat, the more dangerous it is).
Consequently, the reward near the cat, even if it is bigger (more cheese), will be more discounted since we’re not really sure we’ll be able to eat it.
To discount the rewards, we proceed like this:
We define a discount rate called gamma. It must be between 0 and 1. Most of the time between 0.95 and 0.99.
The larger the gamma, the smaller the discount. This means our agent cares more about the long-term reward.
On the other hand, the smaller the gamma, the bigger the discount. This means our agent cares more about the short term reward (the nearest cheese).

14 of 38

Then, each reward will be discounted by gamma to the exponent of the time step. As the time step increases, the cat gets closer to us, so the future reward is less and less likely to happen.
Our discounted expected cumulative reward is:
�

� <

15 of 38

Type of tasks
A task is an instance of a Reinforcement Learning problem. We can have two types of tasks: episodic and continuing.
Episodic task
In this case, we have a starting point and an ending point (a terminal state). This creates an episode: a list of States, Actions, Rewards, and new States.
�

16 of 38

For instance, think about Super Mario Bros: an episode begins at the launch of a new Mario Level and ends when you’re killed or you reached the end of the level.

Beginning of a new episode.

17 of 38

Continuing tasks

These are tasks that continue forever (no terminal state). In this case, the agent must learn how to choose the best actions and simultaneously interact with the environment.

For instance, an agent that does automated stock trading. For this task, there is no starting point and terminal state. The agent keeps running until we decide to stop it.

�

18 of 38

Exploration is exploring the environment by trying random actions in order to find more information about the environment.
Exploitation is exploiting known information to maximize the reward.
Remember, the goal of our RL agent is to maximize the expected cumulative reward. However, we can fall into a common trap.

19 of 38

Exploitation: You go to the same one that you know is good every day and take the risk to miss another better restaurant.
Exploration: Try restaurants you never went to before, with the risk of having a bad experience but the probable opportunity of a fantastic experience.

20 of 38

Two main approaches for solving RL problems
In other words, how do we build an RL agent that can select the actions that maximize its expected cumulative reward?
The Policy π: the agent’s brain
The Policy π is the brain of our Agent, it’s the function that tells us what action to take given the state we are in. So it defines the agent’s behavior at a given time.

21 of 38

Think of policy as the brain of our agent, the function that will tell us the action to take given a state

This Policy is the function we want to learn, our goal is to find the optimal policy π*, the policy that maximizes expected return when the agent acts according to it. We find this π* through training.

22 of 38

There are two approaches to train our agent to find this optimal policy π*:
Directly, by teaching the agent to learn which action to take, given the current state: Policy-Based Methods.
Indirectly, teach the agent to learn which state is more valuable and then take the action that leads to the more valuable states: Value-Based Methods.

23 of 38

Policy-Based Methods

In Policy-Based methods, we learn a policy function directly.

This function will define a mapping from each state to the best corresponding action. Alternatively, it could define a probability distribution over the set of possible actions at that state.

As we can see here, the policy (deterministic) directly indicates the action to take for each step.

24 of 38

We have two types of policies:

Deterministic: a policy at a given state will always return the same action.

action = policy(state) �

26 of 38

Value-based methods
In value-based methods, instead of learning a policy function, we learn a value function that maps a state to the expected value of being at that state.
The value of a state is the expected discounted return the agent can get if it starts in that state, and then acts according to our policy.
“Act according to our policy” just means that our policy is “going to the state with the highest value”.

27 of 38

“Act according to our policy” just means that our policy is “going to the state with the highest value”.

Here we see that our value function defined values for each possible state.�

28 of 38

Here we see that our value function defined values for each possible state.

�

at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.

29 of 38

Deterministic, Fully Observable
Definition: In a deterministic, fully observable environment, the outcomes of actions are completely predictable, and the entire state of the environment is observable at any time.
Characteristics:
Deterministic: Given a state and an action, the next state is always the same.
Fully Observable: The agent can see the entire state of the environment at any time.
Example:
Grid World (Simple): Imagine a grid where an agent moves in one of four directions (up, down, left, right). Each move deterministically leads to a new cell in the grid, and the entire grid is visible to the agent at all times.

30 of 38

Stochastic, Fully Observable
Definition: In a stochastic, fully observable environment, the outcomes of actions are probabilistic, but the agent has complete visibility of the state of the environment.
Characteristics:
Stochastic: Given a state and an action, the next state is determined probabilistically.
Fully Observable: The agent can see the entire state of the environment at any time.
Example:
Board Games (e.g., Chess or Go): In these games, the outcome of moves is probabilistic (in games with elements of chance), but the entire board is visible to the player.
Key Aspects:
State Transition: The transition probabilities must be taken into account.
Decision Making: Requires dealing with uncertainty due to the stochastic nature of state transitions.

31 of 38

Stochastic, Partially Observable
Definition: In a stochastic, partially observable environment, the outcomes of actions are probabilistic, and the agent does not have complete visibility of the state of the environment. The agent only has access to partial observations.
Characteristics:
Stochastic: The environment's response to actions is probabilistic.
Partially Observable: The agent can only observe part of the environment or receive noisy observations.
Example:
Hidden Markov Models (HMMs): In HMMs, the state of the system is hidden, and observations are probabilistic and noisy. For instance, in a robotic navigation task where the robot cannot fully perceive its surroundings due to sensor limitations.
Key Aspects:
State Transition: Both the transition probabilities and partial observability must be considered.
Decision Making: Requires techniques like belief states, filtering, and estimation (e.g., using Partially Observable Markov Decision Processes or POMDPs).

32 of 38

Sequential Decision Problem

Beginning in the start state, agent must choose an action at each time step.
Interaction with environment terminates if the agent reaches one of the goal states (4, 3) (reward of +1) or (4,1) (reward –1). Each other location has a reward of -.04.
In each location the available actions are Up, Down, Left, Right.

33 of 38

Stochastic Actions

Each action achieves the intended effect with probability 0.8, but the rest of the time, the agent moves at right angles to the intended direction.

0.8

0.1

34 of 38

Markov Decision Process (MDP)

s₂

s₃

s₄

s₅

s₁

0.7

0.3

0.9

0.1

0.3

0.4

0.99

0.01

0.2

0.8

r=-10

r=20

r=0

r=1

r=0

35 of 38

Markov Decision Process (MDP)

Given a set of states in an accessible, stochastic environment, an MDP is defined by

Initial state S₀
Transition Model T(s,a,s’)
Reward function R(s)

Transition model: T(s,a,s’) is the probability that state s’ is reached, if action a is executed in state s.

Policy: Complete mapping π that specifies for each state s which action π(s) to take.

Wanted: The optimal policy π* that maximizes the expected utility.

36 of 38

Optimal Policies (1)

Given the optimal policy, the agent uses its current percept that tells it its current state.
It then executes the action π*(s).
We obtain a simple reflex agent that is computed from the information used for a utility-based agent.

Optimal policy for our MDP:

37 of 38

Optimal Policies (2)

R(s) ≤ -1.6248

-0.0221 < R(s) < 0

-0.4278 < R(s) < -0.085

0 < R(s)

How to compute optimal policies?

38 of 38

Horizon and Rewards

Finite : Plan t steps into future. �Reward = R(s⁰)+R(s¹)+R(s²)+…+R(s^t)�Optimal action changes with time!
Infinite : The agent never dies.

The reward R(s⁰)+R(s¹)+R(s²)+… could be unbounded.

Discounted reward : R(s⁰)+γR(s¹)+ γ²R(s²)+…
Average reward : lim _n🡪∞ (1/n)[Σ_i R(sⁱ)]

1 of 38

2 of 38

3 of 38