1 of 40

Reinforcement Learning Tutorial

Brains, Minds, and Machines Summer Course 2018

TA: Xavier Boix & Yen-Ling Kuo

2 of 40

Introduction

RL is the problem of making an agent learn to take actions by interacting with the environment.

3 of 40

The RL Set-up

Agent

World

Action

at

Observation

ot

Reward

rt

At each step t the agent:

Receives an observation of the world ot
Receives a reward rt
Takes an action at

At each step t the world:

Receives an action at
Emits an observation of the world ot+1
Emits a reward rt+1

(Discrete timing)

4 of 40

The State

Agent

World

Action

at

Observation

ot

Reward

rt

Experience: o0, r0, a0, o1, r1, a1, ...

State:

Current state is a summary of experience

st = f(experience) = f(o0, r0, a0, … ot, rt, at)

In fully observable environment, assume:

st = f(ot)

5 of 40

Example RL Set-up

Agent

Video Game

Action

at

Observation

ot

Reward

rt

Set of images

Game Score

Joystick

6 of 40

Example RL Set-up

Agent

Video Game

Action

at

Observation

ot

Reward

rt

Set of images

Game Score

Joystick

Deep Neural Network

7 of 40

RL and Supervised Learning

Agent

Dog vs Cat Dataset

Action at

“It is a Dog/Cat”

Observation ot

New Image

Reward rt

Classification Accuracy of ot-1

Supervised Learning:

Observations are emitted independently of previous agent’s actions. Sequence length = 1.
Immediate feedback (no delayed reward).

8 of 40

Policy

Policy: A way of deciding an action in accordance with the current situation.

The policy is a map from the state to the action:

(deterministic)

(stochastic)

9 of 40

Value Function

The value function: “How much reward will I get from state st?”

(discounted framework)

10 of 40

Value Function

Transition probability to state st+1 from st after action

Markov assumption:

11 of 40

Q-function

12 of 40

Goal

Agent

World

Action

at

Observation

ot

Reward

rt

Find a policy that maximizes rewards.

13 of 40

Types of RL Models

14 of 40

Variants of RL

Agent

World Model

Action

Reward

Policy-based RL

Directly optimize the policy to get good return.

Value-based RL

Estimate the expected returns to choose the optimal policy.

Model-based RL

Build a model of the environment and choose action based on the model.

15 of 40

Value-based RL

How to select a good policy using value functions?

evaluate

set

Reward:

+10

+ 3

- 7

0.2

0.35

0.1

0.35

6.6

3.1

-1.2

3.2

Policy iteration

3.41

16 of 40

Policy-based RL

Directly parameterize policy with 𝜽 → Goal: find the best 𝜽!

Objective to evaluate

Compute gradient wrt 𝜽 and update!

Practically, only take N samples from .

REINFORCE �algorithm

17 of 40

Model-based RL

When the rule is known or easy to learn, we can learn the model

This is a supervised learning problem!

How to select a policy using a model?

Iteratively compute the value and optimal action for each state.

Sample experience from the model.

18 of 40

Exploration vs Exploitation

Exploitation: Taking the best action given the current information
Exploration: Doing things you haven’t done before to collect more information.

Common approaches

ε-greedy: add noise to the greedy policy
probability matching: select actions according to the probability they are best

vs

hard

�Need to discover meaning of the sprites. May need to sacrifice short-term returns.

easy

19 of 40

Sample Efficiency

How many samples do we need to get a good policy?

On-policy

Need to generate new samples every time the policy is changed.
Example: policy gradient

Off-policy

Can improve the policy without generating new samples from that policy.

More efficient

Less efficient

Model-based RL

Policy gradient

20 of 40

Other Algorithms

Combining different types of RL algorithms

Example: Deep Q-Network

Critic estimates the values of the current policy
Actor updates policy in direction that improves value function

Approximate function parameters with deep networks

Example: Actor-Critic

21 of 40

Deep Q-Network

Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529.

22 of 40

Recap: Q-Function

23 of 40

Optimal Value Functions

An optimal value function is the maximum achievable value:

The agent can act optimally with the optimal value function:

24 of 40

The Bellman Equation

25 of 40

The Bellman Equation

26 of 40

The Bellman Equation

27 of 40

The Bellman Equation

Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. (See Bellman, 1957, Chap. III.3.)

28 of 40

Deep Q-Network

Represent value function by a Deep Neural Network with weights w:

29 of 40

Deep Q-Network

(courtesy of Tomotake Sasaki - Fujitsu)

30 of 40

Deep Q-Network

Input

Output

Videogame

31 of 40

Learning

Received reward

Learning goal is to minimize the following:

C. Watkins (1989) proved that this rule leads to convergence to Q*.

32 of 40

Learning

(aka. Temporal Differences Error)

Received reward

Learning goal is to minimize the following:

C. Watkins (1989) proved that this rule leads to convergence to Q*.

33 of 40

Learning

Goal:

34 of 40

Learning

Goal:

Update the target network (w-) after backpropagating a batch of sequences rather than for every sequence (more stable):

35 of 40

Learning

To have a varied batch of sequences to update the network:

A memory stores past sequences
During training a subset of those sequences are randomly chosen
The chosen sequences are used to train

Epsilon greedy

Experience replay

36 of 40

Demo: DQN for Pong!

PyTorch example on �Cart Pole: Colab link��
Your exercise

Try MountainCar-v0 to bring the mountain car uphill!

37 of 40

Discussions

38 of 40

Successful Cases

Work very well in domains governed by simple or known rules�
Learn simple skills with raw sensory inputs, given enough experience

Example: OpenAI dexterity�Train with 6144 CPU cores and 8 GPUs, collect ~100 years experience

Video credit: https://blog.openai.com/learning-dexterity/

39 of 40

Challenges

Humans can learn incredibly quickly

Deep RL methods are usually slow and need more data

Humans can reuse past knowledge
Transfer learning in deep RL

Transfer across problem instances
Transfer from simulations

How to define reward functions
Composition of tasks
Safety of the learned policy

40 of 40

Thanks & Question?

DQN Exercise

https://goo.gl/okaj2w�

Link to the slides

https://goo.gl/3FH7oC�

Let us know if you have any feedback!

Xavier: xboix@mit.edu
Yen-Ling: ylkuo@mit.edu�