Reinforcement Learning Tutorial
Brains, Minds, and Machines Summer Course 2018
TA: Xavier Boix & Yen-Ling Kuo
Introduction
RL is the problem of making an agent learn to take actions by interacting with the environment.
The RL Set-up
Agent
World
Action
at
Observation
ot
Reward
rt
At each step t the agent:
At each step t the world:
(Discrete timing)
The State
Agent
World
Action
at
Observation
ot
Reward
rt
Experience: o0, r0, a0, o1, r1, a1, ...
State:
Current state is a summary of experience
st = f(experience) = f(o0, r0, a0, … ot, rt, at)
In fully observable environment, assume:
st = f(ot)
Example RL Set-up
Agent
Video Game
Action
at
Observation
ot
Reward
rt
Set of images
Game Score
Joystick
Example RL Set-up
Agent
Video Game
Action
at
Observation
ot
Reward
rt
Set of images
Game Score
Joystick
Deep Neural Network
RL and Supervised Learning
Agent
Dog vs Cat Dataset
Action at
“It is a Dog/Cat”
Observation ot
New Image
Reward rt
Classification Accuracy of ot-1
Supervised Learning:
Policy
Policy: A way of deciding an action in accordance with the current situation.
The policy is a map from the state to the action:
(deterministic)
(stochastic)
Value Function
The value function: “How much reward will I get from state st?”
(discounted framework)
Value Function
Transition probability to state st+1 from st after action
Markov assumption:
Q-function
Goal
Agent
World
Action
at
Observation
ot
Reward
rt
Find a policy that maximizes rewards.
Types of RL Models
Variants of RL
Agent
World Model
Action
Reward
Policy-based RL
Directly optimize the policy to get good return.
Value-based RL
Estimate the expected returns to choose the optimal policy.
Model-based RL
Build a model of the environment and choose action based on the model.
Value-based RL
| | | |
| | | |
| | | |
| | | |
Reward:
+10
+ 3
- 7
0.2
0.35
0.1
0.35
6.6
3.1
-1.2
3.2
Policy iteration
3.41
Policy-based RL
Practically, only take N samples from .
REINFORCE �algorithm
Model-based RL
Exploration vs Exploitation
vs
hard
�Need to discover meaning of the sprites. May need to sacrifice short-term returns.
easy
Sample Efficiency
More efficient
Less efficient
Model-based RL
Policy gradient
Other Algorithms
Deep Q-Network
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529.
Recap: Q-Function
Optimal Value Functions
An optimal value function is the maximum achievable value:
The agent can act optimally with the optimal value function:
The Bellman Equation
The Bellman Equation
The Bellman Equation
The Bellman Equation
Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. (See Bellman, 1957, Chap. III.3.)
Deep Q-Network
Represent value function by a Deep Neural Network with weights w:
Deep Q-Network
(courtesy of Tomotake Sasaki - Fujitsu)
Deep Q-Network
Input
Output
Videogame
Learning
Received reward
Learning goal is to minimize the following:
C. Watkins (1989) proved that this rule leads to convergence to Q*.
Learning
(aka. Temporal Differences Error)
Received reward
Learning goal is to minimize the following:
C. Watkins (1989) proved that this rule leads to convergence to Q*.
Learning
Goal:
Learning
Goal:
Update the target network (w-) after backpropagating a batch of sequences rather than for every sequence (more stable):
Learning
To have a varied batch of sequences to update the network:
Epsilon greedy
Experience replay
Demo: DQN for Pong!
Discussions
Successful Cases
Video credit: https://blog.openai.com/learning-dexterity/
Challenges
Thanks & Question?