Low Dimensional State Representation Learning with Reward-shaped Priors
Nicolò Botteghi*, Ruben Obbink, Daan Geijs, Mannes Poel, Beril Sirmacek, Christoph Brune, Abeje Mersha, Stefano Stramigioli
Introduction
Autonomously navigating and exploring the world, is one of the big challenges of mobile robotics.
When the world is not known a priori (e.g. no map is available), many path planning algorithms (e.g. A* or potential fields) cannot be used.
Additionally, we often deal with a continuously changing world that makes pre-coded navigation solutions brittle
Need of adaptable and robust navigation strategies
Introduction
However…
…with high dimensional observations reinforcement learning algorithms are usually not sample-efficient
…no control over the relevant features learned and the state representation
Deep Reinforcement Learning can be used to tackle the problem of learning the optimal actions directly from the sensory data.
Methodology
We propose a framework separating feature learning and policy learning
Methodology
End-to-end framework separating feature learning and policy learning
State Encoder Network
Methodology
End-to-end framework separating feature learning and policy learning
State Encoder Network
Deep Q-Network
Reward-shaped Robotics Priors
Temporal Coherence: State changes are slow and dependent only on the most recent past.
Reward Proportionality (new prior): States with similar changes have similar reward changes.
Causality (new prior): Dissimilar rewards are symptoms of state dissimilarity.
Reward Repeatability (new prior): Reinforces the similarity of states when presenting the same reward variation.
Methodology
Reward-shaped Robotics Priors
Temporal Coherence: State changes are slow and dependent only on the most recent past.
Reward Proportionality (new prior): States with similar changes have similar reward changes.
Causality (new prior): Dissimilar rewards are symptoms of state dissimilarity.
Reward Repeatability (new prior): Reinforces the similarity of states when presenting the same reward variation.
Methodology
Total Loss: weighted sum of the priors with the addition of L2-regularization.
First, we aim at learning an informative state representation to efficiently learn robust navigation policies in different virtual environments
Experimental Design
Where is a positive scalar rewarding reaching the target, is a negative scalar penalizing collisions, is the Euclidean distance from the current position of the robot to the target location and is a scaling factor
Second, we aim at learning the state representation and policy in a simulation environment and transfer the knowledge to the real robot
Training environment (simulated)
Testing environment (real)
Experimental Design
Results
We can learn using the proposed priors an informative state representation and efficiently learn robust navigation policies in different virtual environments
Training environment (simulated)
Testing environment (real)
Results
We can successfully transfer the state representation and policy learned in the simulation environment to the real robot without any re-training
Eventually, we test the robustness of the approach when real environment appears very different from the simulation one, by switching off the lights of the room.
Light off:
searching behaviour
without collisions
Light on:
reaching the target
Results
Thank you for your attention!
The state representation can encode the properties of the task (e.g. distance/orientation with respect to the target). We show the clustering of the state predictions when a reward function based on the distance to the target (A-B) and when the reward is based on the orientation with respect to the target (C-D).
t-SNE visualization of the state predictions
Appendix A
Even when the target location are multiple (2 in this case), the state representation can encode this property.
PCA visualization (left) and t-SNE visualization (right) of the state predictions
Appendix B
The agent, using the state representation and policy learned in the simulation environment, can steer the real robot to the target locations.
On the left: trajectory samples of the real robot when the light is on (blue dots). On the right: trajectory samples when the light is first off (purple dots) and then on (blue dots).
The starting position is indicate by the green rectangle and the target location by the red circle
Appendix C