Deep Reinforcement Learning
B. Ravindran
Reconfigurable and Intelligent Systems Engineering (RISE) Group
Department of Computer Science and Engineering
Robert Bosch Centre for Data Science and Artificial Intelligence (RBC-DSAI)
Indian Institute of Technology Madras
Need for Generalization
Deep RL
CAIR
Need for Deep RL
Deep RL
CAIR
Value Function Approximation
Deep RL
CAIR
Linear Q-learning
Deep RL
CAIR
TD Error
Can exploit the possibility that q values of near-by states wouldn’t change a lot.
Assume the task is to learn a policy on a big gridworld.
Navie method:-
Divide the grid into smaller 10x10 grid.
Abdrupt change in Q-value of states that lie on the boundary.
Coarse coding:-
Avoids abrupt changes in the value of the state. Smoothens the qvalues during transition from one cluster to another.
Issue:-
No uniformity in the number of ‘ON’ bits used to represent a state.
Tile coding:-
Form of coarse coding but systematic.
Number of ‘ON’ bits == Number of tiles used.
Tile and Coarse coding
CMAC(Cerebellar model articulation controller):-
1.Proposed by James Albus in 1975.
2.A form of coarse coding.
3.Has a value of 1 inside of k square regions and 0 elsewhere.
4.Hash function is used to make sure that the k squares are randomly scattered.
5.Implemented in such a way there are exactly c different functions which will be active for any given input.
6.The hash function makes generalization even better.
7.Typically used in places where the input is high dimensional.
Radial Basis Function:-
1.Output of radial basis function depends on the distance between input and some fixed point c.
2.Sum of many radial basis functions can be used to approximate Q(s,a).
Additional Linear Approximators
Non-Linear Function Approximator
1.Linear function approximators are very restrictive. Can only model linear functions. Basis expansion does help to generate non-linear functions in the original input space.
2.Non-linear approximators can model complex functions and are very powerful.
3.The features are learnt on the fly and are not hardcoded as is the case with tile and sparse coding.
4.Can generalize to unseen states.
Disadvantage:-
Requires a lot of data and compute.
Human Level Backgammon player �TD-Gammon (Tesauro 92, 94, 95)
Beat the best human
player in 1995
Learnt completely
by self play
New moves not recorded by humans in centuries of play
Deep RL
CAIR
What about the features?
Deep RL
CAIR
Deep Q-Learning
Deep RL
CAIR
Source: Deep Q Networks, Nature 2015
Deep Q-Learning
Deep RL
CAIR
Source: Deep Q Networks, Nature 2015
Deep Q-Learning
Deep RL
CAIR
Source: Deep Q Networks, Nature 2015
Deep Q-Learning
Deep RL
CAIR
Source: Deep Q Networks, Nature 2015
Feature Learning
Deep Q-Learning
Deep RL
CAIR
Source: Deep Q Networks, Nature 2015
Feature Learning
Linear FA!
Q-Network Learning
Divergence is an issue since the current network is used decide its own target
Deep RL
CAIR
Q-Network Learning
Deep RL
CAIR
To remove correlations, we build data-set from the agent’s experience
Sample experiences from dataset; w - frozen (with periodic updates) to address non-stationarity
Replay Memory
Freeze Target Network
Architecture:-
1.Has a set of convolutional layers which act as
feature extractors.
2.These features are then passed through a series
of fully connected layers.
3. The ouput layer has |A| number of nodes which are used to calculate the Q-value for each action.
The above network is updated using huber loss and not regular least squares loss.
Below is the regular expression for huber loss.
Deep-Q Networks
Maximization Bias
A Q-learning agent is trained on a two state machine as defined above.
Taking right in state A always gives a reward of 0 and puts agent in terminal state.
Taking left in state A leads agent to state B. Agent has |N| actions in state B but every action get a reward sampled from N(-0.1,1) and leads to a terminal state.
In this simple environment it is seen that the agent uses many samples to figure out the right policy.
Q-learning agent tends to over estimate the Q-value.
Due to high variance in the reward after reaching B, it could be possible that some actions have a positive reward eventhough the true expected reward for that action is -0.1. Hence when maximum Q(B,a) is considered there is a lot of chance that some action or the other could have a positive reward.
Hence during the initial stages of training the agent goes left thinking it will receive positive reward from state B and is benificial to take left in state A.
As learning progresses the Q values stabilise and the left action is not preferred in state A.
As mentioned earlier this behaviour is mainly attributed to the max Q(s’,a’) where s’ is the next state of (s,a).
Maximization Bias
Double Q Learning
Double Q-Learning was introduced to handle maximization bias.
Problem:-
The same samples are being used to decide which action is the best and to learn the value function.
Solution:-
1.Two different estimators for Q[Q_A,Q_B] value are being used.
2.At every step either Q_A or Q_B is updated.
3.Q_A is used to determine the maximizing action which is then used to update Q_B and vice-versa.
Deep Double Q-Learning
Different versions of a single estimator of Q value is being used.
A target model Q’ is used for action selection and Q is used for action evaluation.
Updating Q’:-
1. Can hard copy the parameters of Q to Q’ .
2. Polyak averaging.
Dueling DQN
Motivation:-
1. Sometimes it is not necessary to estimate the value of each
action choice in many states.
Eg:- Enduro game of Atari.
The objective is to pass the cars without collision.
Dueling Architecture:-
1.Value network attends on both the horizon of track as well as score.
2.The attention network concentrates only when there are inbound cars. Only in these states there is a need to estimate the value to either moving left/right.
Estimates the Q-value using two both Advantage A(s,a) and V(s).
Advantage:- calculates the benefit of taking an action.
Explicitly separating two estimators, the dueling architecture can learn which states are (or are not) valuable, without having to learn the effect of each action for each state.
Aggregation layer:-
1. A separate aggregation layer is required to combine V(s) and A(s,a).
2. Simple addition wouldn’t work, will fall into the issue of identifiability.
Issue of unidentifiability:-
1.It is impossible to recover V(s) and A(s,a) from Q(s,a).
Overcome:-
1. Average advantage is calculated.
Dueling DQN
Policy Gradient Theorem
Let
Can approximate Qπ(s,a) using the return as we did in MC.
Alternatively we can use the Q value itself, leant via a TD procedure (e.g. SARSA)!
But the value parameterization should be compatible to the action parameterization
Deep RL
CAIR
Policy Gradient Theorem
Let
Can approximate Qπ(s,a) using the return as we did in MC policy gradient.
Alternatively we can use the Q value itself, leant via a TD procedure (e.g. SARSA)!
But the value parameterization should be compatible to the action parameterization
Deep RL
CAIR
Actor-Critic
Actor-Critic
Policy gradient algorithm try to maximize the probability of seeing trajectories that are rewarding. Below is a simple update equation which makes sure this happenes.
Issues with above rule:-
1. High variability in updation.
2. When cumulative reward is 0 then there is no way to learn which actions are good and which are bad.
Solution:-
1. A baseline in introduced which reduces the variance of updation.
The different baselines that can be considered, leading to different versions of actor-critic algorithm.
Actor-Critic
Advantage Actor-Critic (A2C)
Most popular choice of actor-critic algorithm.
The advantage function is used to figure out if an action is good or bad.
The advantage function tells us how much better a given action is relative to general action that is taken.
We don’t need to have two seperate neural networks to obtain A(s,a).
Asynchronous Advantage Actor Critic(A3C)
Makes asynchronous parameter updates.
Parameter server maintains a global copy of the network parameters.
Multiple workers are created and each deployed in separate environments.
The gradient calculated is then sent to the parameter server which is then applied on the global network.
The final global network weights are then sent back to the workers.
The real time training of agent is less.
Are asynchronous updates really needed? No.
(Another look at) Advantage Actor Critic
Deep RL
CAIR
Objective functions are:
n-step truncated
corrected return
Empirically found
n=20 is the best
Sum of rewards obtained on a trajectory
Asynchronous Advantage Actor Critic
Deep RL
CAIR
Objective functions are:
n-step truncated
corrected return
Empirically found
n=20 is the best
Advantage Function
Compare to REINFORCE
Deep RL
CAIR
Reinforcement Baseline
Characteristic Eligibility
Deterministic Policy Gradient
Deep RL
CAIR
Deterministic Policy Gradient
Deep RL
CAIR
Deterministic Policy Gradient
Deep RL
CAIR
Deterministic Policy Gradient
Deep RL
CAIR
Deep Deterministic Policy Gradient (DDPG)
Deep RL
CAIR
Deep Deterministic Policy Gradient: Algorithm
Summing up
Deep RL
CAIR