1 of 40

Reinforcement Learning Tutorial

Brains, Minds, and Machines Summer Course 2018

TA: Xavier Boix & Yen-Ling Kuo

2 of 40

Introduction

RL is the problem of making an agent learn to take actions by interacting with the environment.

3 of 40

The RL Set-up

Agent

World

Action

at

Observation

ot

Reward

rt

At each step t the agent:

  • Receives an observation of the world ot
  • Receives a reward rt
  • Takes an action at

At each step t the world:

  • Receives an action at
  • Emits an observation of the world ot+1
  • Emits a reward rt+1

(Discrete timing)

4 of 40

The State

Agent

World

Action

at

Observation

ot

Reward

rt

Experience: o0, r0, a0, o1, r1, a1, ...

State:

Current state is a summary of experience

st = f(experience) = f(o0, r0, a0,ot, rt, at)

In fully observable environment, assume:

st = f(ot)

5 of 40

Example RL Set-up

Agent

Video Game

Action

at

Observation

ot

Reward

rt

Set of images

Game Score

Joystick

6 of 40

Example RL Set-up

Agent

Video Game

Action

at

Observation

ot

Reward

rt

Set of images

Game Score

Joystick

Deep Neural Network

7 of 40

RL and Supervised Learning

Agent

Dog vs Cat Dataset

Action at

“It is a Dog/Cat”

Observation ot

New Image

Reward rt

Classification Accuracy of ot-1

Supervised Learning:

  • Observations are emitted independently of previous agent’s actions. Sequence length = 1.
  • Immediate feedback (no delayed reward).

8 of 40

Policy

Policy: A way of deciding an action in accordance with the current situation.

The policy is a map from the state to the action:

(deterministic)

(stochastic)

9 of 40

Value Function

The value function: “How much reward will I get from state st?”

(discounted framework)

10 of 40

Value Function

Transition probability to state st+1 from st after action

Markov assumption:

11 of 40

Q-function

12 of 40

Goal

Agent

World

Action

at

Observation

ot

Reward

rt

Find a policy that maximizes rewards.

13 of 40

Types of RL Models

14 of 40

Variants of RL

Agent

World Model

Action

Reward

Policy-based RL

Directly optimize the policy to get good return.

Value-based RL

Estimate the expected returns to choose the optimal policy.

Model-based RL

Build a model of the environment and choose action based on the model.

15 of 40

Value-based RL

  • How to select a good policy using value functions?
  • evaluate
  • set

Reward:

+10

+ 3

- 7

0.2

0.35

0.1

0.35

6.6

3.1

-1.2

3.2

Policy iteration

3.41

16 of 40

Policy-based RL

  • Directly parameterize policy with 𝜽 → Goal: find the best 𝜽!

  • Objective to evaluate

  • Compute gradient wrt 𝜽 and update!

Practically, only take N samples from .

REINFORCE �algorithm

17 of 40

Model-based RL

  • When the rule is known or easy to learn, we can learn the model

    • This is a supervised learning problem!
  • How to select a policy using a model?
    • Iteratively compute the value and optimal action for each state.
    • Sample experience from the model.

18 of 40

Exploration vs Exploitation

  • Exploitation: Taking the best action given the current information
  • Exploration: Doing things you haven’t done before to collect more information.
  • Common approaches
    • ε-greedy: add noise to the greedy policy
    • probability matching: select actions according to the probability they are best

vs

hard

�Need to discover meaning of the sprites. May need to sacrifice short-term returns.

easy

19 of 40

Sample Efficiency

  • How many samples do we need to get a good policy?
  • On-policy
    • Need to generate new samples every time the policy is changed.
    • Example: policy gradient
  • Off-policy
    • Can improve the policy without generating new samples from that policy.

More efficient

Less efficient

Model-based RL

Policy gradient

20 of 40

Other Algorithms

  • Combining different types of RL algorithms
  • Example: Deep Q-Network
  • Critic estimates the values of the current policy
  • Actor updates policy in direction that improves value function
  • Approximate function parameters with deep networks
  • Example: Actor-Critic

21 of 40

Deep Q-Network

Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529.

22 of 40

Recap: Q-Function

23 of 40

Optimal Value Functions

An optimal value function is the maximum achievable value:

The agent can act optimally with the optimal value function:

24 of 40

The Bellman Equation

25 of 40

The Bellman Equation

26 of 40

The Bellman Equation

27 of 40

The Bellman Equation

Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. (See Bellman, 1957, Chap. III.3.)

28 of 40

Deep Q-Network

Represent value function by a Deep Neural Network with weights w:

29 of 40

Deep Q-Network

(courtesy of Tomotake Sasaki - Fujitsu)

30 of 40

Deep Q-Network

Input

Output

Videogame

31 of 40

Learning

Received reward

Learning goal is to minimize the following:

C. Watkins (1989) proved that this rule leads to convergence to Q*.

32 of 40

Learning

(aka. Temporal Differences Error)

Received reward

Learning goal is to minimize the following:

C. Watkins (1989) proved that this rule leads to convergence to Q*.

33 of 40

Learning

Goal:

34 of 40

Learning

Goal:

Update the target network (w-) after backpropagating a batch of sequences rather than for every sequence (more stable):

35 of 40

Learning

To have a varied batch of sequences to update the network:

  • A memory stores past sequences
  • During training a subset of those sequences are randomly chosen
  • The chosen sequences are used to train

Epsilon greedy

Experience replay

36 of 40

Demo: DQN for Pong!

  • PyTorch example on �Cart Pole: Colab link�����
  • Your exercise
    • Try MountainCar-v0 to bring the mountain car uphill!

37 of 40

Discussions

38 of 40

Successful Cases

  • Work very well in domains governed by simple or known rules�
  • Learn simple skills with raw sensory inputs, given enough experience
    • Example: OpenAI dexterity�Train with 6144 CPU cores and 8 GPUs, collect ~100 years experience

39 of 40

Challenges

  • Humans can learn incredibly quickly
    • Deep RL methods are usually slow and need more data
  • Humans can reuse past knowledge
  • Transfer learning in deep RL
    • Transfer across problem instances
    • Transfer from simulations
  • How to define reward functions
  • Composition of tasks
  • Safety of the learned policy

40 of 40

Thanks & Question?

  • DQN Exercise
    • https://goo.gl/okaj2w
  • Link to the slides
    • https://goo.gl/3FH7oC
  • Let us know if you have any feedback!
    • Xavier: xboix@mit.edu
    • Yen-Ling: ylkuo@mit.edu