Survey of Modern
Reinforcement Learning
Julia Maddalena
What to expect from this talk
Part 1 Introduce the foundations of reinforcement learning
Part 2 Review some state-of-the-art methods
Part 3 Current state of reinforcement learning
Part 1
Foundations of reinforcement learning
What is reinforcement learning?
A type of machine learning where an agent interacts with an environment and learns to take actions that result in greater cumulative reward.
Definitions
Reward
Motivation for the agent. Not always obvious what the reward signal should be
YOU WIN! +1
GAME OVER -1
Stay alive +1/second
(sort of)
Agent
The learner and decision maker
Environment
Everything external to the agent used to make decisions
Actions
The set of possible steps the agent can take depending on the state of the environment
The Problem with Rewards...
Designing reward functions is notoriously difficult
Clark, Jack. “Faulty Reward Functions in the Wild.” OpenAI, 21 Dec. 2016.
1 Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018.
Possible reward structure
Human player
“I’ve taken to imagining deep RL as a demon that’s deliberately misinterpreting your reward and actively searching for the laziest possible local optima.”
- Alex Irpan
Resulting RL Agent
More Definitions
Return
Long-term, discounted reward
�
Value
Expected return
value of states → V(s)
how good is it to be in state s
value of state-action pairs → Q(s,a)
how good is it to take action a from state s
discount factor
Policy
How the agent should act from a given state
→ π(a|s)
Markov Decision Process
Markov Process
A random process whose future behavior only depends on the current state.
Sleepy
Energetic
Hungry
70%
50%
15%
70%
50%
35%
50%
Markov Decision Process
Markov Process
Sleepy
Energetic
Hungry
nap
beg
be good
beg
be good
30%
70%
20%
60%
20%
50%
60%
40%
60%
40%
10%
40%
+ Actions
= Markov Decision Process
+2
-1
-2
-1
+10
+7
-2
-1
+10
+5
-6
-4
To model or not to model
Model-based methods
Transition Model |
|
Model-free methods
We don’t know or care about the dynamics, we just want to learn a good policy by
exploring the environment
Sample Model |
and use them to plan our actions to optimize reward |
Planning
Learning
Planning and Learning
Reinforcement Learning Methods
Model-based
Model-free
Transition
Model
Sample
Model
Dynamic Programming
Bellman Equations
Value of each states under optimal policy for Robodog:
Bellman Equation
Bellman Optimality Equation
policy
transition probabilities
value of the next state
reward
discount factor
value of the current state
Policy Iteration
Policy evaluation
Makes the value function consistent with the current policy
Policy improvement
Make the policy greedy with respect to the current value function
be good
beg
be good
beg
100%
50%
50%
50%
50%
sleepy nap
energetic
hungry
be good
beg
be good
beg
100%
0%
100%
100%
0%
sleepy nap
energetic
hungry
state | value |
sleepy | 19.88 |
energetic | 20.97 |
hungry | 20.63 |
state | value |
sleepy | 29.66 |
energetic | 31.66 |
hungry | 31.90 |
Converge to optimal policy and value under optimal policy
Reinforcement Learning Methods
Model-based
Model-free
Transition
Model
On-policy
Off-policy
Sample
Model
Value-based
Policy-based
Dynamic Programming
Sarsa
Q-Learning
Monte Carlo
Temporal Difference
When learning happens
Monte Carlo: wait until end of episode before making updates to value estimates
| | |
| | |
| | |
X | | |
| | |
| | |
X | | |
| O | |
| | |
X | | |
| O | |
| | X |
X | | O |
| O | |
| | X |
X | | O |
| O | |
X | | X |
X | | O |
O | O | |
X | | X |
X | | O |
O | O | |
X | X | X |
Update value for all states in episode
X | | |
| | |
| | |
X | | |
| O | |
| | |
X | | |
| O | |
| | X |
X | | O |
| O | |
| | X |
Update value for previous state
Temporal difference, TD(0): update every step using estimates of next states
bootstrapping
Update value for previous state
Update value for previous state
Update value for previous state
. . .
in this example, learning = updating value of states
Exploration vs exploitation
𝜀-greedy policy
Sarsa
S | A | Q(S, A) |
sleepy | nap | 0 |
energetic | beg | 0 |
energetic | be good | 0 |
hungry | beg | 0 |
hungry | be good | 0 |
Energetic
be good
Hungry
+5
beg
S
A
R
S’
A’
Hungry
beg
Initialize Q(s,a)
For each episode:
• Start in a random state, S.
• Choose action A from S using 𝛆-greedy policy from Q(s,a).
• While S is not terminal:
1. Take action A, observe reward R and new state S’.
2. Choose action A’ from S’ using 𝛆-greedy policy from Q(s,a).
3. Update Q for state S and action A:
4. S ← S’, A ← A’
Q-Learning
3. Update Q for state S and action A:
S | A | Q(S, A) |
sleepy | nap | 0 |
energetic | beg | 0 |
energetic | be good | 0 |
hungry | beg | 0 |
hungry | be good | 0 |
Energetic
be good
Hungry
+5
beg
S
A
R
S’
be good
Q = -1
Q = 2
Hungry
Initialize Q(s,a)
For each episode:
• Start in a random state, S.
• While S is not terminal:
4. S ← S’
beg
Part 2
State-of-the-art methods
Reinforcement Learning Methods
Model-based
Model-free
Transition
Model
On-policy
Off-policy
Sample
Model
Value-based
Policy-based
Dynamic Programming
Dyna-Q
Monte Carlo Tree Search
Sarsa
Q-Learning
Monte Carlo
Temporal Difference
Dyna-Q
For each episode:
• Start in a random state, S.
• While S is not terminal:
| | | Model(S, A) | |
S | A | Q(S, A) | R | S’ |
sleepy | nap | 0 | 0 | NA |
energetic | beg | 0 | 0 | NA |
energetic | be good | 0 | 0 | NA |
hungry | beg | 0 | 0 | NA |
hungry | be good | 0 | 0 | NA |
Energetic
be good
Hungry
+5
ordinary Q-Learning
Hungry
beg
Sleepy
-6
R
R
R
⋮
5
hungry
0.5
R
R
R
⋮
Initialize Q(s,a) and Model(s,a)
4. Update Model for state S and action A:
5. “Hallucinate” n transitions and use them to update Q:
0.95
1.355
1.7195
Dyna-Q
Monte Carlo Tree Search
When brute force expansion of entire game tree (Minimax) is not feasible
Tic-tac-toe
Go
5,478 states
10170 states
Monte Carlo Tree Search
| | |
| | |
| | |
X | | |
| | |
| | |
| | |
| X | |
| | |
| X | |
| | |
| | |
Selection
Choose starting node
Expansion
Look at unexplored child nodes
Simulation
Simulate playing out the game under rollout policy
X | | |
| O | |
| | |
X | X | X |
O | O | |
X | | O |
Update
Update stats
0/0
0/0
0/0
1/1
0/1
0/1
| | |
| X | |
| | |
| X | |
| | |
| | |
| | |
| X | |
| | O |
X | X | O |
O | X | X |
X | O | O |
X | | |
| O | |
| | |
X | X | O |
O | O | X |
X | X | O |
Monte Carlo Tree Search
| | |
| | |
| | |
X | | |
| | |
| | |
Expansion
Look at unexplored child nodes
Simulation
Simulate playing out the game under rollout policy
X | | |
| O | |
| | |
Update
Update stats, including those of parent nodes
1/1
0/1
0/1
| | |
| X | |
| | |
| X | |
| | |
| | |
Selection
Choose child node to explore
(based on how good its stats are and how much it has been ignored)
X | | |
| | |
| | O |
X | O | |
| | |
| | |
0/0
X | | O |
X | O | |
X | O | X |
0/0
0/0
X | | O |
X | | O |
| | O |
X | O | |
O | O | |
X | X | X |
1/1
0/1
1/1
2/2
2/3
3/4
Monte Carlo Tree Search
| | |
| | |
| | |
X | | |
| | |
| | |
Expansion
Look at unexplored child nodes
(only applies if child node has not been expanded before)
Simulation
Simulate playing out the game under rollout policy
X | | |
| O | |
| | |
Update
Update stats, including those of parent nodes
3/4
0/1
0/1
| | |
| X | |
| | |
| X | |
| | |
| | |
Selection
Choose child node to explore
(based on how good its stats are and how much it has been ignored)
X | | |
| | |
| | O |
X | O | |
| | |
| | |
1/1
0/1
1/1
X | O | |
| | |
X | | |
0/0
X | O | |
| X | |
| | |
0/0
X | O | |
| | |
| | X |
0/0
Monte Carlo Tree Search
X | | |
| | |
| | |
47/65
| | |
| X | |
| | |
| X | |
| | |
| | |
| | |
| | |
| | |
32/37
12/33
Deep Reinforcement Learning
2
3
4
5
6
7
8
Black Box
state, s
Q(s, a)
for each action a
1
s | a | Q(s,a) |
1 | X | Q(1, X) |
1 | Y | Q(1, Y) |
1 | Z | Q(1, Z) |
2 | X | Q(2, X) |
2 | Y | Q(2, Y) |
2 | Z | Q(2, Z) |
3 | X | Q(3, X) |
3 | Y | Q(3, Y) |
3 | Z | Q(3, Z) |
4 | X | Q(4, X) |
4 | Y | Q(4, Y) |
4 | Z | Q(4, Z) |
5 | X | Q(5, X) |
5 | Y | Q(5, Y) |
5 | Z | Q(5, Z) |
6 | X | Q(6, X) |
6 | Y | Q(6, Y) |
6 | Z | Q(6, Z) |
7 | X | Q(7, X) |
7 | Y | Q(7, Y) |
7 | Z | Q(7, Z) |
8 | X | Q(8, X) |
8 | Y | Q(8, Y) |
8 | Z | Q(8, Z) |
Q(s,X)
Q(s,Y)
Q(s,Z)
Reinforcement Learning Methods
Model-based
Model-free
Transition
Model
On-policy
Off-policy
Sample
Model
Value-based
Policy-based
Dynamic Programming
Dyna-Q
Monte Carlo Tree Search
Sarsa
Q-Learning
Deep Q Networks*
Monte Carlo Methods
Temporal Difference Methods
* Utilize deep learning
Deep Q Networks (DQN)
Black Box
X
blank
O
state, s
Q(s, a)
for each action a
X | | |
| | |
| | |
X | | |
| O | |
X | | |
how good is it to take this action from this state?
| s | a | r | s’ |
1 | s1 | a1 | r1 | s2 |
2 | s2 | a2 | r2 | s3 |
| ... | ... | ... | ... |
t | st | at | rt | st |
3. Add new information to training data:
4. Use stochastic gradient descent to update weights based on:
Repeat steps 2 - 4 until convergence
ŷ
y
Deep Q Networks (DQN)
Problem:
Solution:
Create a replay buffer of size k to take small samples from
Problem:
Instability introduced when updating Q(s, a) using Q(s’, a’)
Solution:
Have a secondary target network used to evaluate Q(s’, a’) and only sync with primary network after every n training iterations
primary network
target network
| s | a | r | s’ |
1 | s1 | a1 | r1 | s2 |
2 | s2 | a2 | r2 | s3 |
| ... | ... | ... | ... |
t | st | at | rt | st |
| s | a | r | s’ |
1 | s1 | a1 | r1 | s2 |
2 | s2 | a2 | r2 | s3 |
| ... | ... | ... | ... |
t - k | st-k | at-k | rt-k | st-k+1 |
| ... | ... | ... | ... |
t | st | at | rt | st+1 |
Reinforcement Learning Methods
Model-based
Model-free
Transition
Model
On-policy
Off-policy
Sample
Model
Value-based
Policy-based
Dynamic Programming
Dyna-Q
Monte Carlo Tree Search
Sarsa
Q-Learning
Deep Q Networks*
REINFORCE*
Monte Carlo Methods
Temporal Difference Methods
* Utilize deep learning
REINFORCE
Black Box
X
blank
O
state, s
𝛑(a|s)
for each action a
X | | |
| | |
| | |
X | | |
| O | |
X | | |
what is the probability of taking this action under policy 𝛑?
r1
r2
r3
r4
r5
r6
r7
r8
2. Play out a full episode under 𝛑.
3. For every step t, calculate return from that state until the end:
4. Use stochastic gradient descent to update weights based on:
Repeat steps 2 - 4 until convergence
DQN vs REINFORCE
| DQN | REINFORCE |
Learning | Off-policy | On-policy |
Updates | Temporal difference | Monte Carlo |
Output | Q(s,a) ➝ Value-based | 𝛑(a|s) ➝ Policy-based |
Action spaces | Small discrete only | Large discrete or continuous |
Exploration | 𝛆-greedy | Built-in due to stochastic policy |
Convergence | Slower to converge | Faster to converge |
Experience | Less experience needed | More experience needed |
Reinforcement Learning Methods
Model-based
Model-free
Transition
Model
On-policy
Off-policy
Sample
Model
Value-based
Policy-based
Dynamic Programming
Dyna-Q
Monte Carlo Tree Search
Sarsa
Q-Learning
Deep Q Networks*
REINFORCE*
Monte Carlo Methods
Temporal Difference Methods
* Utilize deep learning
Advantage Actor-Critic*
Q Actor-Critic
Common layers
X
blank
O
state, s
Policy net
Value net
𝛑(a|s)
for each action a
Q(s,a)
for each action a
Actor
Policy-based like REINFORCE but can now use temporal difference learning
Critic
Value-based, works sort of like DQN
Quick review
Q-Learning
DQN
REINFORCE
Q Actor-Critic
A2C
Ability to generalize values in state space
Ability to control in continuous action spaces using stochastic policy
One step updates
Reduce variability in gradients
Advantage vs action value
A(S, A)
Q(S, A)
V(S)
Will lead to gradients with high variance, i.e. will take much longer to converge
Baseline = V(s)
Q(S, A)
V(S)
advantage
Will be much quicker to converge!
Advantage Actor-Critic (A2C)
Common layers
X
blank
O
state, s
Policy net
Value net
𝛑(a|s)
for each action a
V(s)
Actor
Policy-based like REINFORCE
Can now use temporal difference learning and baseline:
Critic
Value-based, now learns value of states instead of state-action pairs
Part 3
Current state of reinforcement learning
Current state of reinforcement learning
Mostly in academia or research-focused companies, e.g. DeepMind, OpenAI
1 Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018.
Barriers to entry:
Driverless car, robotics, etc. still largely not using RL.
Promising applications of RL (that aren’t games)
Energy
Finance
Healthcare
Some aspects of robotics
NLP
Computer systems
Traffic light control
Assisting GANs
Neural network architecture
Computer vision
Education
Recommendation systems
Science & Math
References
Clark, Jack. “Faulty Reward Functions in the Wild.” OpenAI, 21 Dec. 2016.
Friedman, Lex (2015). MIT: Introduction to Deep Reinforcement Learning. https://www.youtube.com/watch?v=zR11FLZ-O9M
Fullstack Academy (2017). Monte Carlo Tree Search Tutorial. https://www.youtube.com/watch?v=Fbs4lnGLS8M
Irpan, Alex. “Deep Reinforcement Learning Doesn't Work Yet.” Sorta Insightful, 14 Feb. 2018.
Lapan, M. (2018). Deep reinforcement learning hands-on: Apply modern RL methods, with deep Q-networks, value iteration, policy gradients, TRPO, AlphaGo Zero and more. Birmingham, UK: Packt Publishing.
Silver, David (2015). University College London Reinforcement Learning Course. Lecture 7: Policy Gradient Methods
Towards Data Science. “Applications of Reinforcement Learning in Real World”, 1 Aug 2018.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge, MA: The MIT Press.