1 of 17

Low Dimensional State Representation Learning with Reward-shaped Priors

Nicolò Botteghi*, Ruben Obbink, Daan Geijs, Mannes Poel, Beril Sirmacek, Christoph Brune, Abeje Mersha, Stefano Stramigioli

2 of 17

Introduction

Autonomously navigating and exploring the world, is one of the big challenges of mobile robotics.

When the world is not known a priori (e.g. no map is available), many path planning algorithms (e.g. A* or potential fields) cannot be used.

Additionally, we often deal with a continuously changing world that makes pre-coded navigation solutions brittle

Need of adaptable and robust navigation strategies

3 of 17

Introduction

However…

…with high dimensional observations reinforcement learning algorithms are usually not sample-efficient

…no control over the relevant features learned and the state representation

Deep Reinforcement Learning can be used to tackle the problem of learning the optimal actions directly from the sensory data.

4 of 17

Methodology

We propose a framework separating feature learning and policy learning

5 of 17

Methodology

End-to-end framework separating feature learning and policy learning

State Encoder Network

6 of 17

Methodology

End-to-end framework separating feature learning and policy learning

State Encoder Network

Deep Q-Network

7 of 17

Reward-shaped Robotics Priors

Temporal Coherence: State changes are slow and dependent only on the most recent past.

Reward Proportionality (new prior): States with similar changes have similar reward changes.

Causality (new prior): Dissimilar rewards are symptoms of state dissimilarity.

Reward Repeatability (new prior): Reinforces the similarity of states when presenting the same reward variation.

Methodology

 

 

 

 

8 of 17

Reward-shaped Robotics Priors

Temporal Coherence: State changes are slow and dependent only on the most recent past.

Reward Proportionality (new prior): States with similar changes have similar reward changes.

Causality (new prior): Dissimilar rewards are symptoms of state dissimilarity.

Reward Repeatability (new prior): Reinforces the similarity of states when presenting the same reward variation.

Methodology

 

 

 

 

 

Total Loss: weighted sum of the priors with the addition of L2-regularization.

9 of 17

First, we aim at learning an informative state representation to efficiently learn robust navigation policies in different virtual environments

Experimental Design

 

 

 

 

 

Where is a positive scalar rewarding reaching the target, is a negative scalar penalizing collisions, is the Euclidean distance from the current position of the robot to the target location and is a scaling factor

 

 

 

 

10 of 17

Second, we aim at learning the state representation and policy in a simulation environment and transfer the knowledge to the real robot

Training environment (simulated)

Testing environment (real)

Experimental Design

11 of 17

Results

We can learn using the proposed priors an informative state representation and efficiently learn robust navigation policies in different virtual environments

12 of 17

Training environment (simulated)

Testing environment (real)

Results

We can successfully transfer the state representation and policy learned in the simulation environment to the real robot without any re-training

13 of 17

Eventually, we test the robustness of the approach when real environment appears very different from the simulation one, by switching off the lights of the room.

Light off:

searching behaviour

without collisions

Light on:

reaching the target

Results

14 of 17

Thank you for your attention!

15 of 17

The state representation can encode the properties of the task (e.g. distance/orientation with respect to the target). We show the clustering of the state predictions when a reward function based on the distance to the target (A-B) and when the reward is based on the orientation with respect to the target (C-D).

t-SNE visualization of the state predictions

Appendix A

16 of 17

Even when the target location are multiple (2 in this case), the state representation can encode this property.

PCA visualization (left) and t-SNE visualization (right) of the state predictions

Appendix B

17 of 17

The agent, using the state representation and policy learned in the simulation environment, can steer the real robot to the target locations.

On the left: trajectory samples of the real robot when the light is on (blue dots). On the right: trajectory samples when the light is first off (purple dots) and then on (blue dots).

The starting position is indicate by the green rectangle and the target location by the red circle

Appendix C