1 of 17

Low Dimensional State Representation Learning with Reward-shaped Priors

Nicolò Botteghi*, Ruben Obbink, Daan Geijs, Mannes Poel, Beril Sirmacek, Christoph Brune, Abeje Mersha, Stefano Stramigioli

2 of 17

Introduction

Autonomously navigating and exploring the world, is one of the big challenges of mobile robotics.

When the world is not known a priori (e.g. no map is available), many path planning algorithms (e.g. A* or potential fields) cannot be used.

Additionally, we often deal with a continuously changing world that makes pre-coded navigation solutions brittle

Need of adaptable and robust navigation strategies

3 of 17

Introduction

However…

…with high dimensional observations reinforcement learning algorithms are usually not sample-efficient

…no control over the relevant features learned and the state representation

Deep Reinforcement Learning can be used to tackle the problem of learning the optimal actions directly from the sensory data.

4 of 17

Methodology

We propose a framework separating feature learning and policy learning

5 of 17

Methodology

End-to-end framework separating feature learning and policy learning

State Encoder Network

6 of 17

Methodology

End-to-end framework separating feature learning and policy learning

State Encoder Network

Deep Q-Network

7 of 17

Reward-shaped Robotics Priors

Temporal Coherence: State changes are slow and dependent only on the most recent past.

Reward Proportionality (new prior): States with similar changes have similar reward changes.

Causality (new prior): Dissimilar rewards are symptoms of state dissimilarity.

Reward Repeatability (new prior): Reinforces the similarity of states when presenting the same reward variation.

Methodology

As discussed before we use auxiliary loss functions, called in literature the priors, to shape the state representation with respect to properties of the world (for example with the temporal coherence prior we try to enforce smooth and small changes between the current state and the next state) and the reward function that is maximized by the RL agent. This can be noticed for example in the reward proportionality loss, where we enforce similar state variation delta s between different states indicated by s_t1 and s_t2 if the variation of reward is similar. Similar reasoning is followed in the case of the reward repeatability prior. Moreover, to prevent the trivial mapping where all the states are mapped to the zero point, we use the so-called causality loss to push apart state predictions with different rewards.

8 of 17

Reward-shaped Robotics Priors

Temporal Coherence: State changes are slow and dependent only on the most recent past.

Reward Proportionality (new prior): States with similar changes have similar reward changes.

Causality (new prior): Dissimilar rewards are symptoms of state dissimilarity.

Reward Repeatability (new prior): Reinforces the similarity of states when presenting the same reward variation.

Methodology

Total Loss: weighted sum of the priors with the addition of L2-regularization.

9 of 17

First, we aim at learning an informative state representation to efficiently learn robust navigation policies in different virtual environments

Experimental Design

Where is a positive scalar rewarding reaching the target, is a negative scalar penalizing collisions, is the Euclidean distance from the current position of the robot to the target location and is a scaling factor

10 of 17

Second, we aim at learning the state representation and policy in a simulation environment and transfer the knowledge to the real robot

Training environment (simulated)

Testing environment (real)

Experimental Design

11 of 17

Results

We can learn using the proposed priors an informative state representation and efficiently learn robust navigation policies in different virtual environments

12 of 17

Training environment (simulated)

Testing environment (real)

Results

We can successfully transfer the state representation and policy learned in the simulation environment to the real robot without any re-training

13 of 17

Eventually, we test the robustness of the approach when real environment appears very different from the simulation one, by switching off the lights of the room.

Light off:

searching behaviour

without collisions

Light on:

reaching the target

Results

14 of 17

Thank you for your attention!

15 of 17

The state representation can encode the properties of the task (e.g. distance/orientation with respect to the target). We show the clustering of the state predictions when a reward function based on the distance to the target (A-B) and when the reward is based on the orientation with respect to the target (C-D).

t-SNE visualization of the state predictions

Appendix A

16 of 17

Even when the target location are multiple (2 in this case), the state representation can encode this property.

PCA visualization (left) and t-SNE visualization (right) of the state predictions

Appendix B

17 of 17

The agent, using the state representation and policy learned in the simulation environment, can steer the real robot to the target locations.

On the left: trajectory samples of the real robot when the light is on (blue dots). On the right: trajectory samples when the light is first off (purple dots) and then on (blue dots).

The starting position is indicate by the green rectangle and the target location by the red circle

Appendix C