1 of 43

1

Deep Q-Learning for Hard Exploration Problems

Joan Palacios

Supervisor: Miquel Junyent

2 of 43

2

01

Introduction:

A brief introduction about the main topics of the project.

02

Environment & Implementation:

A definition of the environment and the different implementations of DQL.

03

Results:

Hypothesis, experiments and analysis of the results.

04

Conclusions & future work:

The sum up of the project and its continuation.

Index

3 of 43

3

Introduction

4 of 43

4

1

2

3

4

Motivations

Artificial intelligence

Machine learning

Why is everyone talking about machine learning?

Deep learning

What is a neural network?

Reinforcement learning

How can I teach an agent to be smart?

How it is possible to create artificial intelligence?

5 of 43

5

Reinforcement learning

Machine learning

Unsupervised learning

Clustering/Labeling based on similarity

Supervised learning

Continuous value prediction

Class/Label Prediction

Reinforcement Learning

Agent learns to interact with environment

6 of 43

Background

Q-Learning

Deep Q-Learning

Sequential Decision Process

Characteristics of SDPs

Markov Decision Process

Policy

Value functions

Rewards

Artificial neuron

Multilayer perceptron

Convolutional Neural Network

Activation function

7 of 43

7

Playing Atari with Deep Reinforcement Learning (2013)

Deep Q-Learning

Human-level control through deep reinforcement learning (2015)

Breakout, an arcade game developed and published by Atari and used to test DQL.

8 of 43

8

{

{

Good performance

Bad performance

9 of 43

Sparse reward problem

Sparse rewards environments

Environments where is easy to die

Montezuma's Revenge, an arcade game developed and published by Atari and used to test DQL.

10 of 43

Project goals

  1. Implement Deep Q-Learning

  • Improve Deep Q-Learning to get better results in sparse reward environments

11 of 43

11

Planning

Q-learning

Implement Q-learning

Simple DQL

Implement DQL with a simpler state representation

DQL

Implement DQL

DQL-B-C

Implement DQL with the improvements

12 of 43

12

Environment

13 of 43

Frozen Lake

S

F

F

F

F

H

F

H

F

F

F

H

H

F

F

G

An environment provided by Gym, a toolkit for developing and comparing reinforcement learning algorithms developed by OpenAI.

  • S: starting point, safe
  • F: frozen surface, safe
  • H: hole, fall to your doom
  • G: goal, where the agent has to go

A 4x4 grid.

14 of 43

14

A random map generator

Generates a random valid map (one that has a path from start to goal)

:param size: size of each side of the grid

:param p: probability that a tile is frozen

A 5x5 grid generated randomly.

15 of 43

15

Matrix as state

In Frozen Lake, the representation of the state it’s only a number (the agent’s position inside the grid).

We have built a function that transforms the state, a number, into a matrix.

Used at DQL with a simpler state implementation.

Representation for state 0.

  • 0: goal
  • 1: hole
  • 2: agent
  • 3: frozen surface

16 of 43

16

Image as state

We have implemented a function that receives a matrix representation for the state and transform it into a 64x64 grayscale image.

Used at DQL implementation.

goal

Final representation of the state, a 64x64 grayscale image

hole

agent

frozen surface

17 of 43

17

Plotting Q values

A 4x4 grid showing the Q values for each cell

We have developed a grid which can plot the Q values for each possible action at each cell.

+10

-10

A 4x4 grid represented in a 64x64 grayscale image

18 of 43

18

Implementation

19 of 43

19

The algorithm

20 of 43

20

Deep Q-Learning with a simpler state

16

32

16

4

2

3

3

3

3

1

3

1

3

3

3

1

1

3

3

0

Flatten

matrix

.

.

.

ReLU

ReLU

21 of 43

21

Deep Q-Learning

= ReLU

22 of 43

22

Deep Q-Learning with Boltzmann Count-Based exploration

1.

We use a Boltzmann exploration technique instead of an epsilon-greedy strategy

2.

We introduce an intrinsic exploration bonus r+ using a Count-Based exploration.

23 of 43

23

DQL-B-C

24 of 43

24

DEMO!

  • A 6x6 map

  • More "labyrinth" structure so no direct/diagonal route from initial to final state

  • Instability DQL-B-C

25 of 43

25

Results

26 of 43

26

Hypothesis & Experiments

Hypothesis:

DQL-B-C works better than DQL in sparse reward environments.

Experiments:

Test DQL-B-C vs. DQL in 3 different environments, increasingly more difficult. In addition, we have compared DQL-B and DQL-C.

DQL-B-C

DQL-B

DQL-C

VS.

DQL

27 of 43

27

Experiment specifications

  • Hyperparameters

We have chosen them informally, pinning the others and changing one by one.

  • Runs

Each algorithm runs 4 times for each map.

  • Maps

There are 3 maps: 4x4, 6x6, 8x8.

The first map (4x4)

The second map (6x6)

The third map (8x8)

28 of 43

28

4x4

The first map (4x4)

29 of 43

29

6x6

The second map (6x6)

30 of 43

30

8x8

The third map (8x8)

31 of 43

31

Conclusions

32 of 43

32

Conclusions

Boltzmann exploration works better than epsilon-greedy

The most significant improvement was made by the implementation of Count-Based exploration

The DQL-B-C was the algorithm with best results in most of the three maps

33 of 43

33

Boltzmann / epsilon-greedy

Boltzmann exploration works better than epsilon-greedy

34 of 43

34

Count-Based exploration

The most significant improvement was made by the implementation of Count-Based exploration

35 of 43

35

DQN-B-C

The DQL-B-C was the algorithm with best results in most of the three maps

36 of 43

36

Future work

37 of 43

37

Future work

3

1

2

4

5

Environments

Implementing this algorithm in others environments such Montezuma’s Revenge

Counting

Counting more complex states/pseudocounter

Robotics

A more practical orientation about the problem

?

Curiosity?

Curiosity-driven Exploration by Self-supervised Prediction

38 of 43

38

Possible improvements

39 of 43

39

Hyperparameter table

40 of 43

40

Increase stability of the neural network

  • L2 regularization (done)

  • Limit the norm of the gradient, to avoid explosions gradients.

  • Use statistics (mean, std) in evaluation mode with batch normalization.

41 of 43

Project goals

  • Implement Deep Q-Learning ✔️

  • Improve Deep Q-Learning to get better results in sparse reward environments* ✔️

*In Frozen Lake

42 of 43

42

THANKS

Does anyone have any questions?

43 of 43

References

Oh, Junhyuk et al. (2015). “Action-conditional video prediction using deep networks in atari games”. In: Advances in Neural Information Processing Systems, pp. 2863–2871.

Paszke, A. et al. (2019). ‘PyTorch: An Imperative Style, High-Performance Deep Learning Library’. In: Advances in Neural Information Processing Systems 32. Ed. by Wallach, H. et al. Curran Associates, Inc., pp. 8024–8035.

Riedmiller, M. et al. (2018). ‘Learning by playing-solving sparse reward tasks from scratch’. In: arXiv preprint arXiv:1802.10567.

Ruder, S. (2016). An overview of gradient descent optimization algorithms. url: https://ml-cheatsheet.readthedocs.io/ (visited on 20/08/2020).

Simonyan, K. and Zisserman,A. (2014). ‘Very deep convolutional networks for large-scale image recognition’. In: arXiv preprint arXiv:1409.1556.

Elsevier Amsterdam. Sutton, R. S. and Barto, A. G. (2018).