The RL Process
The RL Process: a loop of state, action, reward and next state
�
There is a differentiation to make between observation and state, however:
In chess game, we receive a state from the environment since we have access to the whole check board information.
In a chess game, we have access to the whole board information, so we receive a state from the environment. In other words, the environment is fully observed.
In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.
In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.
In Super Mario Bros, we are in a partially observed environment. We receive an observation since we only see a part of the level.
Action Space
The Action space is the set of all possible actions in an environment.
The actions can come from a discrete or continuous space:
In Super Mario Bros, we have only 4 possible actions: left, right, up (jumping) and down (crouching).
Again, in Super Mario Bros, we have a finite set of actions since we have only 4 directions.
Rewards and the discounting
The reward is fundamental in RL because it’s the only feedback for the agent. Thanks to it, our agent knows if the action taken was good or not.
The cumulative reward at each time step t can be written as:
The cumulative reward equals the sum of all rewards in the sequence.
Which is equivalent to:
The cumulative reward = rt+1 (rt+k+1 = rt+0+1 = rt+1)+ rt+2 (rt+k+1 = rt+1+1 = rt+2) +
Let’s say your agent is this tiny mouse that can move one tile each time step, and your opponent is the cat (that can move too). The mouse’s goal is to eat the maximum amount of cheese before being eaten by the cat.
�
� <
For instance, think about Super Mario Bros: an episode begins at the launch of a new Mario Level and ends when you’re killed or you reached the end of the level.
Beginning of a new episode.
Continuing tasks
These are tasks that continue forever (no terminal state). In this case, the agent must learn how to choose the best actions and simultaneously interact with the environment.
For instance, an agent that does automated stock trading. For this task, there is no starting point and terminal state. The agent keeps running until we decide to stop it.
�
Think of policy as the brain of our agent, the function that will tell us the action to take given a state
This Policy is the function we want to learn, our goal is to find the optimal policy π*, the policy that maximizes expected return when the agent acts according to it. We find this π* through training.
Policy-Based Methods
In Policy-Based methods, we learn a policy function directly.
This function will define a mapping from each state to the best corresponding action. Alternatively, it could define a probability distribution over the set of possible actions at that state.
As we can see here, the policy (deterministic) directly indicates the action to take for each step.
We have two types of policies:
action = policy(state) �
“Act according to our policy” just means that our policy is “going to the state with the highest value”.
Here we see that our value function defined values for each possible state.�
Here we see that our value function defined values for each possible state.
�
Sequential Decision Problem
Stochastic Actions
0.8
0.1
0.1
Markov Decision Process (MDP)
s2
s3
s4
s5
s1
0.7
0.3
0.9
0.1
0.3
0.3
0.4
0.99
0.01
0.2
0.8
r=-10
r=20
r=0
r=1
r=0
Markov Decision Process (MDP)
Given a set of states in an accessible, stochastic environment, an MDP is defined by
Transition model: T(s,a,s’) is the probability that state s’ is reached, if action a is executed in state s.
Policy: Complete mapping π that specifies for each state s which action π(s) to take.
Wanted: The optimal policy π* that maximizes the expected utility.
Optimal Policies (1)
Optimal policy for our MDP:
Optimal Policies (2)
R(s) ≤ -1.6248
-0.0221 < R(s) < 0
-0.4278 < R(s) < -0.085
0 < R(s)
How to compute optimal policies?
Horizon and Rewards
The reward R(s0)+R(s1)+R(s2)+… could be unbounded.