1 of 43

Survey of Modern

Reinforcement Learning

Julia Maddalena

2 of 43

What to expect from this talk

Part 1 Introduce the foundations of reinforcement learning

  • Definitions and basic ideas
  • A couple algorithms that work in simple environments

Part 2 Review some state-of-the-art methods

  • Higher level concepts, vanilla methods
  • Not a complete list of cutting edge methods

Part 3 Current state of reinforcement learning

3 of 43

Part 1

Foundations of reinforcement learning

4 of 43

What is reinforcement learning?

A type of machine learning where an agent interacts with an environment and learns to take actions that result in greater cumulative reward.

5 of 43

Definitions

Reward

Motivation for the agent. Not always obvious what the reward signal should be

YOU WIN! +1

GAME OVER -1

Stay alive +1/second

(sort of)

Agent

The learner and decision maker

Environment

Everything external to the agent used to make decisions

Actions

The set of possible steps the agent can take depending on the state of the environment

6 of 43

The Problem with Rewards...

Designing reward functions is notoriously difficult

Clark, Jack. “Faulty Reward Functions in the Wild.” OpenAI, 21 Dec. 2016.

1 Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018.

Possible reward structure

  • Total points
  • Time to finish
  • Finishing position

Human player

“I’ve taken to imagining deep RL as a demon that’s deliberately misinterpreting your reward and actively searching for the laziest possible local optima.”

- Alex Irpan

Resulting RL Agent

7 of 43

More Definitions

Return

Long-term, discounted reward

Value

Expected return

value of states → V(s)

how good is it to be in state s

value of state-action pairs → Q(s,a)

how good is it to take action a from state s

discount factor

Policy

How the agent should act from a given state

π(a|s)

8 of 43

Markov Decision Process

Markov Process

A random process whose future behavior only depends on the current state.

Sleepy

Energetic

Hungry

70%

50%

15%

70%

50%

35%

50%

9 of 43

Markov Decision Process

Markov Process

Sleepy

Energetic

Hungry

nap

beg

be good

beg

be good

30%

70%

20%

60%

20%

50%

60%

40%

60%

40%

10%

40%

+ Actions

  • Reward

= Markov Decision Process

+2

-1

-2

-1

+10

+7

-2

-1

+10

+5

-6

-4

10 of 43

To model or not to model

Model-based methods

Transition Model

  • We already know the dynamics of the environment
  • We simply need to plan our actions to optimize reward

Model-free methods

We don’t know or care about the dynamics, we just want to learn a good policy by

exploring the environment

Sample Model

  • We don’t know the dynamics
  • We try to learn them by exploring the environment

and use them to plan our actions to optimize reward

Planning

Learning

Planning and Learning

11 of 43

Reinforcement Learning Methods

Model-based

Model-free

Transition

Model

Sample

Model

Dynamic Programming

12 of 43

Bellman Equations

Value of each states under optimal policy for Robodog:

Bellman Equation

Bellman Optimality Equation

policy

transition probabilities

value of the next state

reward

discount factor

value of the current state

13 of 43

Policy Iteration

Policy evaluation

Makes the value function consistent with the current policy

Policy improvement

Make the policy greedy with respect to the current value function

be good

beg

be good

beg

100%

50%

50%

50%

50%

sleepy nap

energetic

hungry

be good

beg

be good

beg

100%

0%

100%

100%

0%

sleepy nap

energetic

hungry

state

value

sleepy

19.88

energetic

20.97

hungry

20.63

state

value

sleepy

29.66

energetic

31.66

hungry

31.90

Converge to optimal policy and value under optimal policy

14 of 43

Reinforcement Learning Methods

Model-based

Model-free

Transition

Model

On-policy

Off-policy

Sample

Model

Value-based

Policy-based

Dynamic Programming

Sarsa

Q-Learning

Monte Carlo

Temporal Difference

15 of 43

When learning happens

Monte Carlo: wait until end of episode before making updates to value estimates

X

X

O

X

O

X

X

O

O

X

X

O

O

X

X

X

O

O

O

X

X

X

O

O

O

X

X

X

Update value for all states in episode

X

X

O

X

O

X

X

O

O

X

Update value for previous state

Temporal difference, TD(0): update every step using estimates of next states

bootstrapping

Update value for previous state

Update value for previous state

Update value for previous state

. . .

in this example, learning = updating value of states

16 of 43

Exploration vs exploitation

  • Greedy policy will get stuck in local optima early on
  • We need to explore to be able to learn, but we need to exploit in order to gather the most reward.
  • With value-based methods, we need to determine policy from value.

𝜀-greedy policy

17 of 43

Sarsa

S

A

Q(S, A)

sleepy

nap

0

energetic

beg

0

energetic

be good

0

hungry

beg

0

hungry

be good

0

Energetic

be good

Hungry

+5

beg

S

A

R

S’

A’

Hungry

beg

Initialize Q(s,a)

For each episode:

• Start in a random state, S.

• Choose action A from S using 𝛆-greedy policy from Q(s,a).

• While S is not terminal:

1. Take action A, observe reward R and new state S’.

2. Choose action A’ from S’ using 𝛆-greedy policy from Q(s,a).

3. Update Q for state S and action A:

4. S ← S’, A ← A’

18 of 43

Q-Learning

3. Update Q for state S and action A:

S

A

Q(S, A)

sleepy

nap

0

energetic

beg

0

energetic

be good

0

hungry

beg

0

hungry

be good

0

Energetic

be good

Hungry

+5

beg

S

A

R

S’

be good

Q = -1

Q = 2

Hungry

Initialize Q(s,a)

For each episode:

• Start in a random state, S.

• While S is not terminal:

  1. Choose action A from S using 𝛆-greedy policy from Q(s,a).
  2. Take action A, observe reward R and new state S’.

4. S ← S’

beg

19 of 43

Part 2

State-of-the-art methods

20 of 43

Reinforcement Learning Methods

Model-based

Model-free

Transition

Model

On-policy

Off-policy

Sample

Model

Value-based

Policy-based

Dynamic Programming

Dyna-Q

Monte Carlo Tree Search

Sarsa

Q-Learning

Monte Carlo

Temporal Difference

21 of 43

Dyna-Q

For each episode:

• Start in a random state, S.

• While S is not terminal:

  1. Choose action A from S using 𝛆-greedy policy from Q(s,a).
  2. Take action A, observe reward R and new state S’.
  3. Update Q for state S and action A:

Model(S, A)

S

A

Q(S, A)

R

S’

sleepy

nap

0

0

NA

energetic

beg

0

0

NA

energetic

be good

0

0

NA

hungry

beg

0

0

NA

hungry

be good

0

0

NA

Energetic

be good

Hungry

+5

ordinary Q-Learning

Hungry

beg

Sleepy

-6

R

R

R

5

hungry

0.5

R

R

R

Initialize Q(s,a) and Model(s,a)

4. Update Model for state S and action A:

5. “Hallucinate” n transitions and use them to update Q:

0.95

1.355

1.7195

22 of 43

Dyna-Q

23 of 43

Monte Carlo Tree Search

When brute force expansion of entire game tree (Minimax) is not feasible

Tic-tac-toe

Go

5,478 states

10170 states

24 of 43

Monte Carlo Tree Search

X

X

X

Selection

Choose starting node

Expansion

Look at unexplored child nodes

Simulation

Simulate playing out the game under rollout policy

X

O

X

X

X

O

O

X

O

Update

Update stats

0/0

0/0

0/0

1/1

0/1

0/1

X

X

X

O

X

X

O

O

X

X

X

O

O

X

O

X

X

O

O

O

X

X

X

O

25 of 43

Monte Carlo Tree Search

X

Expansion

Look at unexplored child nodes

Simulation

Simulate playing out the game under rollout policy

X

O

Update

Update stats, including those of parent nodes

1/1

0/1

0/1

X

X

Selection

Choose child node to explore

(based on how good its stats are and how much it has been ignored)

X

O

X

O

0/0

X

O

X

O

X

O

X

0/0

0/0

X

O

X

O

O

X

O

O

O

X

X

X

1/1

0/1

1/1

2/2

2/3

3/4

26 of 43

Monte Carlo Tree Search

X

Expansion

Look at unexplored child nodes

(only applies if child node has not been expanded before)

Simulation

Simulate playing out the game under rollout policy

X

O

Update

Update stats, including those of parent nodes

3/4

0/1

0/1

X

X

Selection

Choose child node to explore

(based on how good its stats are and how much it has been ignored)

X

O

X

O

1/1

0/1

1/1

X

O

X

0/0

X

O

X

0/0

X

O

X

0/0

27 of 43

Monte Carlo Tree Search

X

47/65

X

X

32/37

12/33

28 of 43

Deep Reinforcement Learning

2

3

4

5

6

7

8

Black Box

state, s

Q(s, a)

for each action a

1

s

a

Q(s,a)

1

X

Q(1, X)

1

Y

Q(1, Y)

1

Z

Q(1, Z)

2

X

Q(2, X)

2

Y

Q(2, Y)

2

Z

Q(2, Z)

3

X

Q(3, X)

3

Y

Q(3, Y)

3

Z

Q(3, Z)

4

X

Q(4, X)

4

Y

Q(4, Y)

4

Z

Q(4, Z)

5

X

Q(5, X)

5

Y

Q(5, Y)

5

Z

Q(5, Z)

6

X

Q(6, X)

6

Y

Q(6, Y)

6

Z

Q(6, Z)

7

X

Q(7, X)

7

Y

Q(7, Y)

7

Z

Q(7, Z)

8

X

Q(8, X)

8

Y

Q(8, Y)

8

Z

Q(8, Z)

Q(s,X)

Q(s,Y)

Q(s,Z)

29 of 43

Reinforcement Learning Methods

Model-based

Model-free

Transition

Model

On-policy

Off-policy

Sample

Model

Value-based

Policy-based

Dynamic Programming

Dyna-Q

Monte Carlo Tree Search

Sarsa

Q-Learning

Deep Q Networks*

Monte Carlo Methods

Temporal Difference Methods

* Utilize deep learning

30 of 43

Deep Q Networks (DQN)

Black Box

X

blank

O

state, s

Q(s, a)

for each action a

X

X

O

X

how good is it to take this action from this state?

  1. Initialize network.

  1. Take one action under Q policy.

s

a

r

s’

1

s1

a1

r1

s2

2

s2

a2

r2

s3

...

...

...

...

t

st

at

rt

st

3. Add new information to training data:

4. Use stochastic gradient descent to update weights based on:

Repeat steps 2 - 4 until convergence

ŷ

y

31 of 43

Deep Q Networks (DQN)

  1. Initialize network.
  2. Take one action under Q policy.
  3. Add new information to training data:

  1. Use stochastic gradient descent to update weights based on:

Problem:

  • Data not i.i.d.
  • Data collected based on an evolving policy, not the optimal policy that we are trying to learn.

Solution:

Create a replay buffer of size k to take small samples from

Problem:

Instability introduced when updating Q(s, a) using Q(s’, a’)

Solution:

Have a secondary target network used to evaluate Q(s’, a’) and only sync with primary network after every n training iterations

primary network

target network

s

a

r

s’

1

s1

a1

r1

s2

2

s2

a2

r2

s3

...

...

...

...

t

st

at

rt

st

s

a

r

s’

1

s1

a1

r1

s2

2

s2

a2

r2

s3

...

...

...

...

t - k

st-k

at-k

rt-k

st-k+1

...

...

...

...

t

st

at

rt

st+1

32 of 43

Reinforcement Learning Methods

Model-based

Model-free

Transition

Model

On-policy

Off-policy

Sample

Model

Value-based

Policy-based

Dynamic Programming

Dyna-Q

Monte Carlo Tree Search

Sarsa

Q-Learning

Deep Q Networks*

REINFORCE*

Monte Carlo Methods

Temporal Difference Methods

* Utilize deep learning

33 of 43

REINFORCE

Black Box

X

blank

O

state, s

𝛑(a|s)

for each action a

X

X

O

X

what is the probability of taking this action under policy 𝛑?

  1. Initialize network.

r1

r2

r3

r4

r5

r6

r7

r8

2. Play out a full episode under 𝛑.

3. For every step t, calculate return from that state until the end:

4. Use stochastic gradient descent to update weights based on:

Repeat steps 2 - 4 until convergence

34 of 43

DQN vs REINFORCE

DQN

REINFORCE

Learning

Off-policy

On-policy

Updates

Temporal difference

Monte Carlo

Output

Q(s,a) ➝ Value-based

𝛑(a|s) ➝ Policy-based

Action spaces

Small discrete only

Large discrete or continuous

Exploration

𝛆-greedy

Built-in due to stochastic policy

Convergence

Slower to converge

Faster to converge

Experience

Less experience needed

More experience needed

35 of 43

Reinforcement Learning Methods

Model-based

Model-free

Transition

Model

On-policy

Off-policy

Sample

Model

Value-based

Policy-based

Dynamic Programming

Dyna-Q

Monte Carlo Tree Search

Sarsa

Q-Learning

Deep Q Networks*

REINFORCE*

Monte Carlo Methods

Temporal Difference Methods

* Utilize deep learning

Advantage Actor-Critic*

36 of 43

Q Actor-Critic

Common layers

X

blank

O

state, s

Policy net

Value net

𝛑(a|s)

for each action a

Q(s,a)

for each action a

Actor

Policy-based like REINFORCE but can now use temporal difference learning

Critic

Value-based, works sort of like DQN

37 of 43

Quick review

Q-Learning

DQN

REINFORCE

Q Actor-Critic

A2C

Ability to generalize values in state space

Ability to control in continuous action spaces using stochastic policy

One step updates

Reduce variability in gradients

38 of 43

Advantage vs action value

A(S, A)

Q(S, A)

V(S)

Will lead to gradients with high variance, i.e. will take much longer to converge

Baseline = V(s)

Q(S, A)

V(S)

advantage

Will be much quicker to converge!

39 of 43

Advantage Actor-Critic (A2C)

Common layers

X

blank

O

state, s

Policy net

Value net

𝛑(a|s)

for each action a

V(s)

Actor

Policy-based like REINFORCE

Can now use temporal difference learning and baseline:

Critic

Value-based, now learns value of states instead of state-action pairs

40 of 43

Part 3

Current state of reinforcement learning

41 of 43

Current state of reinforcement learning

Mostly in academia or research-focused companies, e.g. DeepMind, OpenAI

  • Most impressive progress has been made in games

1 Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018.

Barriers to entry:

  • Too much real-world experience required
  • Simulation is often not realistic enough
  • Poor convergence properties
  • There has not been enough development in transfer learning for RL models
    • Models do not generalize well outside of what they are trained on

Driverless car, robotics, etc. still largely not using RL.

  • “The rule-of-thumb is that except in rare cases, domain-specific algorithms work faster and better than reinforcement learning.”1

42 of 43

Promising applications of RL (that aren’t games)

Energy

Finance

Healthcare

Some aspects of robotics

NLP

Computer systems

Traffic light control

Assisting GANs

Neural network architecture

Computer vision

Education

Recommendation systems

Science & Math

43 of 43

References

Clark, Jack. “Faulty Reward Functions in the Wild.” OpenAI, 21 Dec. 2016.

Friedman, Lex (2015). MIT: Introduction to Deep Reinforcement Learning. https://www.youtube.com/watch?v=zR11FLZ-O9M

Fullstack Academy (2017). Monte Carlo Tree Search Tutorial. https://www.youtube.com/watch?v=Fbs4lnGLS8M

Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018.

Lapan, M. (2018). Deep reinforcement learning hands-on: Apply modern RL methods, with deep Q-networks, value iteration, policy gradients, TRPO, AlphaGo Zero and more. Birmingham, UK: Packt Publishing.

Silver, David (2015). University College London Reinforcement Learning Course. Lecture 7: Policy Gradient Methods

Towards Data Science. “Applications of Reinforcement Learning in Real World”, 1 Aug 2018.

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge, MA: The MIT Press.