1 of 43

Survey of Modern

Reinforcement Learning

Julia Maddalena

2 of 43

What to expect from this talk

Part 1 Introduce the foundations of reinforcement learning

Definitions and basic ideas
A couple algorithms that work in simple environments

Part 2 Review some state-of-the-art methods

Higher level concepts, vanilla methods
Not a complete list of cutting edge methods

Part 3 Current state of reinforcement learning

3 of 43

Part 1

Foundations of reinforcement learning

4 of 43

What is reinforcement learning?

A type of machine learning where an agent interacts with an environment and learns to take actions that result in greater cumulative reward.

5 of 43

Definitions

Reward

Motivation for the agent. Not always obvious what the reward signal should be

YOU WIN! +1

GAME OVER -1

Stay alive +1/second

(sort of)

Agent

The learner and decision maker

Environment

Everything external to the agent used to make decisions

Actions

The set of possible steps the agent can take depending on the state of the environment

6 of 43

The Problem with Rewards...

Designing reward functions is notoriously difficult

Clark, Jack. “Faulty Reward Functions in the Wild.” OpenAI, 21 Dec. 2016.

¹ Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018.

Possible reward structure

Total points
Time to finish
Finishing position

Human player

“I’ve taken to imagining deep RL as a demon that’s deliberately misinterpreting your reward and actively searching for the laziest possible local optima.”

- Alex Irpan

Resulting RL Agent

7 of 43

More Definitions

Return

Long-term, discounted reward

�

Value

Expected return

value of states → V(s)

how good is it to be in state s

value of state-action pairs → Q(s,a)

how good is it to take action a from state s

discount factor

Policy

How the agent should act from a given state

→ π(a|s)

8 of 43

Markov Decision Process

Markov Process

A random process whose future behavior only depends on the current state.

Sleepy

Energetic

Hungry

70%

50%

15%

70%

50%

35%

50%

9 of 43

Markov Decision Process

Markov Process

Sleepy

Energetic

Hungry

nap

beg

be good

beg

be good

30%

70%

20%

60%

20%

50%

60%

40%

60%

40%

10%

40%

+ Actions

Reward

= Markov Decision Process

+2

-1

-2

-1

+10

+7

-2

-1

+10

+5

-6

-4

10 of 43

To model or not to model

Model-based methods

Transition Model	We already know the dynamics of the environment We simply need to plan our actions to optimize reward

Model-free methods

We don’t know or care about the dynamics, we just want to learn a good policy by

exploring the environment

Sample Model	We don’t know the dynamics We try to learn them by exploring the environment and use them to plan our actions to optimize reward

Planning

Learning

Planning and Learning

11 of 43

Reinforcement Learning Methods

Model-based

Model-free

Transition

Model

Sample

Model

Dynamic Programming

12 of 43

Bellman Equations

Value of each states under optimal policy for Robodog:

Bellman Equation

Bellman Optimality Equation

policy

transition probabilities

value of the next state

reward

discount factor

value of the current state

13 of 43

Policy Iteration

Policy evaluation

Makes the value function consistent with the current policy

Policy improvement

Make the policy greedy with respect to the current value function

be good

beg

be good

beg

100%

50%

sleepy nap

energetic

hungry

be good

beg

be good

beg

100%

0%

100%

0%

sleepy nap

energetic

hungry

state	value
sleepy	19.88
energetic	20.97
hungry	20.63

state	value
sleepy	29.66
energetic	31.66
hungry	31.90

Converge to optimal policy and value under optimal policy

14 of 43

Reinforcement Learning Methods

Model-based

Model-free

Transition

Model

On-policy

Off-policy

Sample

Model

Value-based

Policy-based

Dynamic Programming

Sarsa

Q-Learning

Monte Carlo

Temporal Difference

15 of 43

When learning happens

Monte Carlo: wait until end of episode before making updates to value estimates

X

X
	O

X
	O
		X

X		O
	O
		X

X		O
	O
X		X

X		O
O	O
X		X

X		O
O	O
X	X	X

Update value for all states in episode

X

X
	O

X
	O
		X

X		O
	O
		X

Update value for previous state

Temporal difference, TD(0): update every step using estimates of next states

bootstrapping

Update value for previous state

. . .

in this example, learning = updating value of states

16 of 43

Exploration vs exploitation

Greedy policy will get stuck in local optima early on
We need to explore to be able to learn, but we need to exploit in order to gather the most reward.
With value-based methods, we need to determine policy from value.

𝜀-greedy policy

17 of 43

Sarsa

S	A	Q(S, A)
sleepy	nap	0
energetic	beg	0
energetic	be good	0
hungry	beg	0
hungry	be good	0

Energetic

be good

Hungry

+5

beg

S

A

R

S’

A’

Hungry

beg

Initialize Q(s,a)

For each episode:

• Start in a random state, S.

• Choose action A from S using 𝛆-greedy policy from Q(s,a).

• While S is not terminal:

1. Take action A, observe reward R and new state S’.

2. Choose action A’ from S’ using 𝛆-greedy policy from Q(s,a).

3. Update Q for state S and action A:

4. S ← S’, A ← A’

18 of 43

Q-Learning

3. Update Q for state S and action A:

S	A	Q(S, A)
sleepy	nap	0
energetic	beg	0
energetic	be good	0
hungry	beg	0
hungry	be good	0

Energetic

be good

Hungry

+5

beg

S

A

R

S’

be good

Q = -1

Q = 2

Hungry

Initialize Q(s,a)

For each episode:

• Start in a random state, S.

• While S is not terminal:

Choose action A from S using 𝛆-greedy policy from Q(s,a).
Take action A, observe reward R and new state S’.

4. S ← S’

beg

19 of 43

Part 2

State-of-the-art methods

20 of 43

Reinforcement Learning Methods

Model-based

Model-free

Transition

Model

On-policy

Off-policy

Sample

Model

Value-based

Policy-based

Dynamic Programming

Dyna-Q

Monte Carlo Tree Search

Sarsa

Q-Learning

Monte Carlo

Temporal Difference

21 of 43

Dyna-Q

For each episode:

• Start in a random state, S.

• While S is not terminal:

Choose action A from S using 𝛆-greedy policy from Q(s,a).
Take action A, observe reward R and new state S’.
Update Q for state S and action A:

			Model(S, A)
S	A	Q(S, A)	R	S’
sleepy	nap	0	0	NA
energetic	beg	0	0	NA
energetic	be good	0	0	NA
hungry	beg	0	0	NA
hungry	be good	0	0	NA

Energetic

be good

Hungry

+5

ordinary Q-Learning

Hungry

beg

Sleepy

-6

R

⋮

5

hungry

0.5

R

⋮

Initialize Q(s,a) and Model(s,a)

4. Update Model for state S and action A:

5. “Hallucinate” n transitions and use them to update Q:

0.95

1.355

1.7195

22 of 43

Dyna-Q

23 of 43

Monte Carlo Tree Search

When brute force expansion of entire game tree (Minimax) is not feasible

Tic-tac-toe

Go

5,478 states

10¹⁷⁰ states

24 of 43

Monte Carlo Tree Search

X


	X

	X

Selection

Choose starting node

Expansion

Look at unexplored child nodes

Simulation

Simulate playing out the game under rollout policy

X
	O

X	X	X
O	O
X		O

Update

Update stats

0/0

1/1

0/1


	X

	X


	X
		O

X	X	O
O	X	X
X	O	O

X
	O

X	X	O
O	O	X
X	X	O

25 of 43

Monte Carlo Tree Search

X

Expansion

Look at unexplored child nodes

Simulation

Simulate playing out the game under rollout policy

X
	O

Update

Update stats, including those of parent nodes

1/1

0/1


	X

	X

Selection

Choose child node to explore

(based on how good its stats are and how much it has been ignored)

X

		O

X	O

0/0

X		O
X	O
X	O	X

0/0

X		O
X		O
		O

X	O
O	O
X	X	X

1/1

0/1

1/1

2/2

2/3

3/4

26 of 43

Monte Carlo Tree Search

X

Expansion

Look at unexplored child nodes

(only applies if child node has not been expanded before)

Simulation

Simulate playing out the game under rollout policy

X
	O

Update

Update stats, including those of parent nodes

3/4

0/1


	X

	X

Selection

Choose child node to explore

(based on how good its stats are and how much it has been ignored)

X

		O

X	O

1/1

0/1

1/1

X	O

X

0/0

X	O
	X

0/0

X	O

		X

0/0

27 of 43

Monte Carlo Tree Search

X

47/65


	X

	X

32/37

12/33

28 of 43

Deep Reinforcement Learning

2

3

4

5

6

7

8

Black Box

state, s

Q(s, a)

for each action a

1

s	a	Q(s,a)
1	X	Q(1, X)
1	Y	Q(1, Y)
1	Z	Q(1, Z)
2	X	Q(2, X)
2	Y	Q(2, Y)
2	Z	Q(2, Z)
3	X	Q(3, X)
3	Y	Q(3, Y)
3	Z	Q(3, Z)
4	X	Q(4, X)
4	Y	Q(4, Y)
4	Z	Q(4, Z)
5	X	Q(5, X)
5	Y	Q(5, Y)
5	Z	Q(5, Z)
6	X	Q(6, X)
6	Y	Q(6, Y)
6	Z	Q(6, Z)
7	X	Q(7, X)
7	Y	Q(7, Y)
7	Z	Q(7, Z)
8	X	Q(8, X)
8	Y	Q(8, Y)
8	Z	Q(8, Z)

Q(s,X)

Q(s,Y)

Q(s,Z)

29 of 43

Reinforcement Learning Methods

Model-based

Model-free

Transition

Model

On-policy

Off-policy

Sample

Model

Value-based

Policy-based

Dynamic Programming

Dyna-Q

Monte Carlo Tree Search

Sarsa

Q-Learning

Deep Q Networks*

Monte Carlo Methods

Temporal Difference Methods

* Utilize deep learning

30 of 43

Deep Q Networks (DQN)

Black Box

X

blank

O

state, s

Q(s, a)

for each action a

X

X
	O
X

how good is it to take this action from this state?

Initialize network.

Take one action under Q policy.

	s	a	r	s’
1	s₁	a₁	r₁	s₂
2	s₂	a₂	r₂	s₃
	...	...	...	...
t	s_t	a_t	r_t	s_t

3. Add new information to training data:

4. Use stochastic gradient descent to update weights based on:

Repeat steps 2 - 4 until convergence

ŷ

y

31 of 43

Deep Q Networks (DQN)

Initialize network.
Take one action under Q policy.
Add new information to training data:

Use stochastic gradient descent to update weights based on:

Problem:

Data not i.i.d.
Data collected based on an evolving policy, not the optimal policy that we are trying to learn.

Solution:

Create a replay buffer of size k to take small samples from

Problem:

Instability introduced when updating Q(s, a) using Q(s’, a’)

Solution:

Have a secondary target network used to evaluate Q(s’, a’) and only sync with primary network after every n training iterations

primary network

target network

	s	a	r	s’
1	s₁	a₁	r₁	s₂
2	s₂	a₂	r₂	s₃
	...	...	...	...
t	s_t	a_t	r_t	s_t

	s	a	r	s’
1	s₁	a₁	r₁	s₂
2	s₂	a₂	r₂	s₃
	...	...	...	...
t - k	s_t-k	a_t-k	r_t-k	s_t-k+1
	...	...	...	...
t	s_t	a_t	r_t	s_t+1

32 of 43

Reinforcement Learning Methods

Model-based

Model-free

Transition

Model

On-policy

Off-policy

Sample

Model

Value-based

Policy-based

Dynamic Programming

Dyna-Q

Monte Carlo Tree Search

Sarsa

Q-Learning

Deep Q Networks*

REINFORCE*

Monte Carlo Methods

Temporal Difference Methods

* Utilize deep learning

33 of 43

REINFORCE

Black Box

X

blank

O

state, s

𝛑(a|s)

for each action a

X

X
	O
X

what is the probability of taking this action under policy 𝛑?

Initialize network.

r₁

r₂

r₃

r₄

r₅

r₆

r₇

r₈

2. Play out a full episode under 𝛑.

3. For every step t, calculate return from that state until the end:

4. Use stochastic gradient descent to update weights based on:

Repeat steps 2 - 4 until convergence

34 of 43

DQN vs REINFORCE

	DQN	REINFORCE
Learning	Off-policy	On-policy

Updates	Temporal difference	Monte Carlo

Output	Q(s,a) ➝ Value-based	𝛑(a\|s) ➝ Policy-based

Action spaces	Small discrete only	Large discrete or continuous

Exploration	𝛆-greedy	Built-in due to stochastic policy

Convergence	Slower to converge	Faster to converge

Experience	Less experience needed	More experience needed

35 of 43

Reinforcement Learning Methods

Model-based

Model-free

Transition

Model

On-policy

Off-policy

Sample

Model

Value-based

Policy-based

Dynamic Programming

Dyna-Q

Monte Carlo Tree Search

Sarsa

Q-Learning

Deep Q Networks*

REINFORCE*

Monte Carlo Methods

Temporal Difference Methods

* Utilize deep learning

Advantage Actor-Critic*

36 of 43

Q Actor-Critic

Common layers

X

blank

O

state, s

Policy net

Value net

𝛑(a|s)

for each action a

Q(s,a)

for each action a

Actor

Policy-based like REINFORCE but can now use temporal difference learning

Critic

Value-based, works sort of like DQN

37 of 43

Quick review

Q-Learning

DQN

REINFORCE

Q Actor-Critic

A2C

Ability to generalize values in state space

Ability to control in continuous action spaces using stochastic policy

One step updates

Reduce variability in gradients

38 of 43

Advantage vs action value

A(S, A)

Q(S, A)

V(S)

Will lead to gradients with high variance, i.e. will take much longer to converge

Baseline = V(s)

Q(S, A)

V(S)

advantage

Will be much quicker to converge!

39 of 43

Advantage Actor-Critic (A2C)

Common layers

X

blank

O

state, s

Policy net

Value net

𝛑(a|s)

for each action a

V(s)

Actor

Policy-based like REINFORCE

Can now use temporal difference learning and baseline:

Critic

Value-based, now learns value of states instead of state-action pairs

40 of 43

Part 3

Current state of reinforcement learning

41 of 43

Current state of reinforcement learning

Mostly in academia or research-focused companies, e.g. DeepMind, OpenAI

Most impressive progress has been made in games

¹ Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018.

Barriers to entry:

Too much real-world experience required
Simulation is often not realistic enough
Poor convergence properties
There has not been enough development in transfer learning for RL models

Models do not generalize well outside of what they are trained on

Driverless car, robotics, etc. still largely not using RL.

“The rule-of-thumb is that except in rare cases, domain-specific algorithms work faster and better than reinforcement learning.”¹

42 of 43

Promising applications of RL (that aren’t games)

Energy

Finance

Healthcare

Some aspects of robotics

NLP

Computer systems

Traffic light control

Assisting GANs

Neural network architecture

Computer vision

Education

Recommendation systems

Science & Math

43 of 43

References

Clark, Jack. “Faulty Reward Functions in the Wild.” OpenAI, 21 Dec. 2016.

Friedman, Lex (2015). MIT: Introduction to Deep Reinforcement Learning. https://www.youtube.com/watch?v=zR11FLZ-O9M

Fullstack Academy (2017). Monte Carlo Tree Search Tutorial. https://www.youtube.com/watch?v=Fbs4lnGLS8M

Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018.

Lapan, M. (2018). Deep reinforcement learning hands-on: Apply modern RL methods, with deep Q-networks, value iteration, policy gradients, TRPO, AlphaGo Zero and more. Birmingham, UK: Packt Publishing.

Silver, David (2015). University College London Reinforcement Learning Course. Lecture 7: Policy Gradient Methods

Towards Data Science. “Applications of Reinforcement Learning in Real World”, 1 Aug 2018.

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge, MA: The MIT Press.