1 of 40

Deep Reinforcement Learning

B. Ravindran

Reconfigurable and Intelligent Systems Engineering (RISE) Group

Department of Computer Science and Engineering

Robert Bosch Centre for Data Science and Artificial Intelligence (RBC-DSAI)

Indian Institute of Technology Madras

2 of 40

Need for Generalization

  • Issues with large state/action spaces:
    • tabular approaches not memory efficient
    • data sparsity
    • continuous state/action spaces
  • We need methods that can generalize experience gained over a limited subset of the state space
  • Use a parameterized representation

Deep RL

CAIR

3 of 40

Need for Deep RL

  • Issues with large state/action spaces:
    • tabular approaches not memory efficient
    • data sparsity
    • continuous state/action spaces
  • Use a parameterized representation
    • Value Functions
    • Policies
    • Models

Deep RL

CAIR

4 of 40

Value Function Approximation

  • Least squares

  • But we don’t know the target!
  • Use the TD target.

Deep RL

CAIR

5 of 40

Linear Q-learning

  • Known to converge to close to the RMSE minimizer if the policy is fixed
  • No such results for Q learning
  • No strong results for other complex parameterizations
    • But many successful examples: TD Gammon, Atari,…
  • Deep RL!!

Deep RL

CAIR

TD Error

6 of 40

Can exploit the possibility that q values of near-by states wouldn’t change a lot.

Assume the task is to learn a policy on a big gridworld.

Navie method:-

Divide the grid into smaller 10x10 grid.

Abdrupt change in Q-value of states that lie on the boundary.

Coarse coding:-

Avoids abrupt changes in the value of the state. Smoothens the qvalues during transition from one cluster to another.

Issue:-

No uniformity in the number of ‘ON’ bits used to represent a state.

Tile coding:-

Form of coarse coding but systematic.

Number of ‘ON’ bits == Number of tiles used.

Tile and Coarse coding

7 of 40

CMAC(Cerebellar model articulation controller):-

1.Proposed by James Albus in 1975.

2.A form of coarse coding.

3.Has a value of 1 inside of k square regions and 0 elsewhere.

4.Hash function is used to make sure that the k squares are randomly scattered.

5.Implemented in such a way there are exactly c different functions which will be active for any given input.

6.The hash function makes generalization even better.

7.Typically used in places where the input is high dimensional.

Radial Basis Function:-

1.Output of radial basis function depends on the distance between input and some fixed point c.

2.Sum of many radial basis functions can be used to approximate Q(s,a).

Additional Linear Approximators

8 of 40

Non-Linear Function Approximator

1.Linear function approximators are very restrictive. Can only model linear functions. Basis expansion does help to generate non-linear functions in the original input space.

2.Non-linear approximators can model complex functions and are very powerful.

3.The features are learnt on the fly and are not hardcoded as is the case with tile and sparse coding.

4.Can generalize to unseen states.

Disadvantage:-

Requires a lot of data and compute.

9 of 40

Human Level Backgammon player �TD-Gammon (Tesauro 92, 94, 95)

Beat the best human

player in 1995

Learnt completely

by self play

New moves not recorded by humans in centuries of play

Deep RL

CAIR

10 of 40

What about the features?

Deep RL

CAIR

  • Learnt to play from video input from scratch!

11 of 40

Deep Q-Learning

Deep RL

CAIR

Source: Deep Q Networks, Nature 2015

12 of 40

Deep Q-Learning

  • Input: 84 × 84 × 4
  • Layer 1: conv 8x8, stride=4, 16 filters -- ReLU
  • Layer 2: conv 4x4, stride=2, 32 filters -- ReLU
  • Layer 3: fully_connected 256 – ReLU
  • Output layer: fully_connected 18 - Linear

Deep RL

CAIR

Source: Deep Q Networks, Nature 2015

13 of 40

Deep Q-Learning

Deep RL

CAIR

Source: Deep Q Networks, Nature 2015

14 of 40

Deep Q-Learning

Deep RL

CAIR

Source: Deep Q Networks, Nature 2015

Feature Learning

15 of 40

Deep Q-Learning

Deep RL

CAIR

Source: Deep Q Networks, Nature 2015

Feature Learning

Linear FA!

16 of 40

Q-Network Learning

Divergence is an issue since the current network is used decide its own target

  • Correlations between samples
  • Non-stationarity of the targets

  • How do we address these issues? � Replay Memory� Freeze target network

Deep RL

CAIR

17 of 40

Q-Network Learning

Deep RL

CAIR

To remove correlations, we build data-set from the agent’s experience

Sample experiences from dataset; w - frozen (with periodic updates) to address non-stationarity

Replay Memory

Freeze Target Network

18 of 40

Architecture:-

1.Has a set of convolutional layers which act as

feature extractors.

2.These features are then passed through a series

of fully connected layers.

3. The ouput layer has |A| number of nodes which are used to calculate the Q-value for each action.

The above network is updated using huber loss and not regular least squares loss.

Below is the regular expression for huber loss.

Deep-Q Networks

19 of 40

Maximization Bias

A Q-learning agent is trained on a two state machine as defined above.

Taking right in state A always gives a reward of 0 and puts agent in terminal state.

Taking left in state A leads agent to state B. Agent has |N| actions in state B but every action get a reward sampled from N(-0.1,1) and leads to a terminal state.

20 of 40

In this simple environment it is seen that the agent uses many samples to figure out the right policy.

Q-learning agent tends to over estimate the Q-value.

Due to high variance in the reward after reaching B, it could be possible that some actions have a positive reward eventhough the true expected reward for that action is -0.1. Hence when maximum Q(B,a) is considered there is a lot of chance that some action or the other could have a positive reward.

Hence during the initial stages of training the agent goes left thinking it will receive positive reward from state B and is benificial to take left in state A.

As learning progresses the Q values stabilise and the left action is not preferred in state A.

As mentioned earlier this behaviour is mainly attributed to the max Q(s’,a’) where s’ is the next state of (s,a).

Maximization Bias

21 of 40

Double Q Learning

Double Q-Learning was introduced to handle maximization bias.

Problem:-

The same samples are being used to decide which action is the best and to learn the value function.

Solution:-

1.Two different estimators for Q[Q_A,Q_B] value are being used.

2.At every step either Q_A or Q_B is updated.

3.Q_A is used to determine the maximizing action which is then used to update Q_B and vice-versa.

22 of 40

Deep Double Q-Learning

Different versions of a single estimator of Q value is being used.

A target model Q’ is used for action selection and Q is used for action evaluation.

Updating Q’:-

1. Can hard copy the parameters of Q to Q’ .

2. Polyak averaging.

23 of 40

Dueling DQN

Motivation:-

1. Sometimes it is not necessary to estimate the value of each

action choice in many states.

Eg:- Enduro game of Atari.

The objective is to pass the cars without collision.

Dueling Architecture:-

1.Value network attends on both the horizon of track as well as score.

2.The attention network concentrates only when there are inbound cars. Only in these states there is a need to estimate the value to either moving left/right.

Estimates the Q-value using two both Advantage A(s,a) and V(s).

Advantage:- calculates the benefit of taking an action.

Explicitly separating two estimators, the dueling architecture can learn which states are (or are not) valuable, without having to learn the effect of each action for each state.

24 of 40

Aggregation layer:-

1. A separate aggregation layer is required to combine V(s) and A(s,a).

2. Simple addition wouldn’t work, will fall into the issue of identifiability.

Issue of unidentifiability:-

1.It is impossible to recover V(s) and A(s,a) from Q(s,a).

Overcome:-

1. Average advantage is calculated.

Dueling DQN

25 of 40

Policy Gradient Theorem

Let

Can approximate Qπ(s,a) using the return as we did in MC.

Alternatively we can use the Q value itself, leant via a TD procedure (e.g. SARSA)!

But the value parameterization should be compatible to the action parameterization

Deep RL

CAIR

26 of 40

Policy Gradient Theorem

Let

Can approximate Qπ(s,a) using the return as we did in MC policy gradient.

Alternatively we can use the Q value itself, leant via a TD procedure (e.g. SARSA)!

But the value parameterization should be compatible to the action parameterization

Deep RL

CAIR

Actor-Critic

27 of 40

Actor-Critic

Policy gradient algorithm try to maximize the probability of seeing trajectories that are rewarding. Below is a simple update equation which makes sure this happenes.

Issues with above rule:-

1. High variability in updation.

2. When cumulative reward is 0 then there is no way to learn which actions are good and which are bad.

Solution:-

1. A baseline in introduced which reduces the variance of updation.

28 of 40

The different baselines that can be considered, leading to different versions of actor-critic algorithm.

Actor-Critic

29 of 40

Advantage Actor-Critic (A2C)

Most popular choice of actor-critic algorithm.

The advantage function is used to figure out if an action is good or bad.

The advantage function tells us how much better a given action is relative to general action that is taken.

We don’t need to have two seperate neural networks to obtain A(s,a).

30 of 40

Asynchronous Advantage Actor Critic(A3C)

Makes asynchronous parameter updates.

Parameter server maintains a global copy of the network parameters.

Multiple workers are created and each deployed in separate environments.

The gradient calculated is then sent to the parameter server which is then applied on the global network.

The final global network weights are then sent back to the workers.

The real time training of agent is less.

Are asynchronous updates really needed? No.

31 of 40

(Another look at) Advantage Actor Critic

Deep RL

CAIR

Objective functions are:

n-step truncated

corrected return

Empirically found

n=20 is the best

Sum of rewards obtained on a trajectory

32 of 40

Asynchronous Advantage Actor Critic

Deep RL

CAIR

Objective functions are:

n-step truncated

corrected return

Empirically found

n=20 is the best

Advantage Function

33 of 40

Compare to REINFORCE

Deep RL

CAIR

Reinforcement Baseline

Characteristic Eligibility

34 of 40

Deterministic Policy Gradient

  • Problems with continuous action spaces
    • Hard to implement differentiable continuous controllers in many problems
    • Reduces the expectation to over only the states greatly simplifying the gradient estimation

Deep RL

CAIR

35 of 40

Deterministic Policy Gradient

  • Problems with continuous action spaces
    • Hard to implement differentiable continuous controllers in many problems
    • Reduces the expectation to over only the states greatly simplifying the gradient estimation

Deep RL

CAIR

36 of 40

Deterministic Policy Gradient

  • Problems with continuous action spaces
    • Hard to implement differentiable continuous controllers in many problems
    • Reduces the expectation to over only the states greatly simplifying the gradient estimation
    • Avoids max over actions
  • Key difficulty: What is the notion of gradient for deterministic policies?
    • Continuous action spaces allow us to think of change in actions w.r.t. policy parameter

Deep RL

CAIR

37 of 40

Deterministic Policy Gradient

  • Deterministic Policy Gradient Theorem

  • Can be shown to be the limit of the Stochastic Policy Gradient Theorem in the limit the variance of the policy goes to zero.
  • Exploration?
    • If the environment is stochastic then not an issue
    • Otherwise use off-policy actor-critic, where the behavior policy differs from the estimation policy
    • Q – learning for value function updates
  • Still require compatible parameterizations

Deep RL

CAIR

38 of 40

Deep Deterministic Policy Gradient (DDPG)

  • Combine ideas from DQN with DPG
  • Key ideas:
    • Average instead of max
    • Soft target networks
      • Weight not frozen but very slowly track the active network

    • Batch normalization for stability!
  • Works well for continuous action spaces

Deep RL

CAIR

39 of 40

Deep Deterministic Policy Gradient: Algorithm

40 of 40

Summing up

  • Many important methods not covered
    • Soft Actor-Critic (SAC)
    • Distributional RL
    • TRPO, PPO, etc.
    • Twin Delayed DDPG (TD3)
  • Many innovations coming up (weekly?!)
    • Meta-RL
    • Deep Hierarchical RL
    • Upside down RL, decision transformers, etc.
    • Useful DNN tricks
  • Amazing applications
    • Mu Zero
    • Alpha Fold
    • Loon Project

Deep RL

CAIR