1
Deep Q-Learning for Hard Exploration Problems
Joan Palacios
Supervisor: Miquel Junyent
2
01
Introduction:
A brief introduction about the main topics of the project.
02
Environment & Implementation:
A definition of the environment and the different implementations of DQL.
03
Results:
Hypothesis, experiments and analysis of the results.
04
Conclusions & future work:
The sum up of the project and its continuation.
Index
3
Introduction
4
1
2
3
4
Motivations
Artificial intelligence
Machine learning
Why is everyone talking about machine learning?
Deep learning
What is a neural network?
Reinforcement learning
How can I teach an agent to be smart?
How it is possible to create artificial intelligence?
5
Reinforcement learning
Machine learning
Unsupervised learning
Clustering/Labeling based on similarity
Supervised learning
Continuous value prediction
Class/Label Prediction
Reinforcement Learning
Agent learns to interact with environment
Background
Q-Learning
Deep Q-Learning
Sequential Decision Process
Characteristics of SDPs
Markov Decision Process
Policy
Value functions
Rewards
Artificial neuron
Multilayer perceptron
Convolutional Neural Network
Activation function
7
Playing Atari with Deep Reinforcement Learning (2013)
Deep Q-Learning
Human-level control through deep reinforcement learning (2015)
Breakout, an arcade game developed and published by Atari and used to test DQL.
8
{
{
Good performance
Bad performance
Sparse reward problem
Sparse rewards environments
Environments where is easy to die
Montezuma's Revenge, an arcade game developed and published by Atari and used to test DQL.
Project goals
11
Planning
Q-learning
Implement Q-learning
Simple DQL
Implement DQL with a simpler state representation
DQL
Implement DQL
DQL-B-C
Implement DQL with the improvements
12
Environment
Frozen Lake
S | F | F | F |
F | H | F | H |
F | F | F | H |
H | F | F | G |
An environment provided by Gym, a toolkit for developing and comparing reinforcement learning algorithms developed by OpenAI.
A 4x4 grid.
14
A random map generator
Generates a random valid map (one that has a path from start to goal)
:param size: size of each side of the grid
:param p: probability that a tile is frozen
A 5x5 grid generated randomly.
15
Matrix as state
In Frozen Lake, the representation of the state it’s only a number (the agent’s position inside the grid).
We have built a function that transforms the state, a number, into a matrix.
Used at DQL with a simpler state implementation.
Representation for state 0.
16
Image as state
We have implemented a function that receives a matrix representation for the state and transform it into a 64x64 grayscale image.
Used at DQL implementation.
goal
Final representation of the state, a 64x64 grayscale image
hole
agent
frozen surface
17
Plotting Q values
A 4x4 grid showing the Q values for each cell
We have developed a grid which can plot the Q values for each possible action at each cell.
+10
-10
A 4x4 grid represented in a 64x64 grayscale image
18
Implementation
19
The algorithm
20
Deep Q-Learning with a simpler state
16
32
16
4
2
3
3
3
3
1
3
1
3
3
3
1
1
3
3
0
Flatten
matrix
.
.
.
ReLU
ReLU
21
Deep Q-Learning
= ReLU
22
Deep Q-Learning with Boltzmann Count-Based exploration
1.
We use a Boltzmann exploration technique instead of an epsilon-greedy strategy
2.
We introduce an intrinsic exploration bonus r+ using a Count-Based exploration.
23
DQL-B-C
24
DEMO!
25
Results
26
Hypothesis & Experiments
Hypothesis:
DQL-B-C works better than DQL in sparse reward environments.
Experiments:
Test DQL-B-C vs. DQL in 3 different environments, increasingly more difficult. In addition, we have compared DQL-B and DQL-C.
DQL-B-C
DQL-B
DQL-C
VS.
DQL
27
Experiment specifications
We have chosen them informally, pinning the others and changing one by one.
Each algorithm runs 4 times for each map.
There are 3 maps: 4x4, 6x6, 8x8.
The first map (4x4)
The second map (6x6)
The third map (8x8)
28
4x4
The first map (4x4)
29
6x6
The second map (6x6)
30
8x8
The third map (8x8)
31
Conclusions
32
Conclusions
Boltzmann exploration works better than epsilon-greedy
The most significant improvement was made by the implementation of Count-Based exploration
The DQL-B-C was the algorithm with best results in most of the three maps
33
Boltzmann / epsilon-greedy
Boltzmann exploration works better than epsilon-greedy
34
Count-Based exploration
The most significant improvement was made by the implementation of Count-Based exploration
35
DQN-B-C
The DQL-B-C was the algorithm with best results in most of the three maps
36
Future work
37
Future work
3
1
2
4
5
Environments
Implementing this algorithm in others environments such Montezuma’s Revenge
Counting
Counting more complex states/pseudocounter
Robotics
A more practical orientation about the problem
?
Curiosity?
Curiosity-driven Exploration by Self-supervised Prediction
38
Possible improvements
39
Hyperparameter table
40
Increase stability of the neural network
Project goals
*In Frozen Lake
42
THANKS
Does anyone have any questions?
References
Oh, Junhyuk et al. (2015). “Action-conditional video prediction using deep networks in atari games”. In: Advances in Neural Information Processing Systems, pp. 2863–2871.
Paszke, A. et al. (2019). ‘PyTorch: An Imperative Style, High-Performance Deep Learning Library’. In: Advances in Neural Information Processing Systems 32. Ed. by Wallach, H. et al. Curran Associates, Inc., pp. 8024–8035.
Riedmiller, M. et al. (2018). ‘Learning by playing-solving sparse reward tasks from scratch’. In: arXiv preprint arXiv:1802.10567.
Ruder, S. (2016). An overview of gradient descent optimization algorithms. url: https://ml-cheatsheet.readthedocs.io/ (visited on 20/08/2020).
Simonyan, K. and Zisserman,A. (2014). ‘Very deep convolutional networks for large-scale image recognition’. In: arXiv preprint arXiv:1409.1556.
Elsevier Amsterdam. Sutton, R. S. and Barto, A. G. (2018).