1 of 18

RBE595 RL Research Paper Review

Keith Chester, Bob DeMont

Emergent tool use from multi-agent interaction

Worcester Polytechnic Institute

2 of 18

Hide and Seek through Reinforcement Learning

Worcester Polytechnic Institute

3 of 18

Background Challenges

Researchers aimed to create agents that could solve human relevant tasks, especially tasks involving sensing and acting in physical worlds.
In supervised learning approaches, proper reward engineering requires time consuming human input and fails to scale to increasingly complex tasks.
Alternatively, unsupervised learning approaches (such as intrinsic motivation) scale poorly and don’t mirror natural learning

Worcester Polytechnic Institute

4 of 18

Related Previous Work

Early work explored self play using genetic algorithms
More recent explored conditions for open-ended evolution and co-adaptation between agents and environments
Self play with deep RL led to superior results in Backgammon, Go, Dota and others
Finally multi-agents trained in a 3D physics

environment learned emergent behaviors

like passing in soccer simulations

Other work involved tool use; but the agents

were incentivized for the use

Worcester Polytechnic Institute

The paper specifically calls out previous related works in utilizing genetic algorithm with self-play. Self-play is when agents play against themselves or additional agents that are simultaneously being trained; no outside input is necessarily required.

Recent work also looked into open-ended evolution and co-adaptation between agents and environments.

The paper does mention the popular success of AlphaGo and AlphaZero with self play in Backgammon, Go, DotA, and other games, including against elite human players in each field.

Most recently, multi-agent scenarios played in simulated 3D environments with physics rules learned emergent behaviors like passing in soccer simulations

Other work involved tool use, but the agents were incentivized to use the tools: They didn’t discover them organically.

5 of 18

Key Concepts

Natural learning evolved through co-evolution and competition (natural selection)
Coevolution and competition create autocurricula - where competing agents constantly create new tasks and challenges for each other
Previous work has been done in complex and physical grounded environments

Worcester Polytechnic Institute

6 of 18

Whats New

A mixed competitive/ cooperative physics based environment in which agents play hide and seek.
Agents create their own incentive for tool use through multi-agent competition
Main contributions of the work:

Emergent autocurricula in multi-agent self play with strategy shifts
Grounded in a physical environment, autocurricula can lead to human relevant skills like tool use
Proposal of a framework for evaluating agents in open-ended env
Open sourced environment and code to encourage research

Worcester Polytechnic Institute

7 of 18

Setup

The game: Hide and Seek

Preparation period where hiders are free to move and seekers are “locked”

The environment: objects agents can grab or lock objects in place and random immovable walls exist. Power by MuJoCo physics engine
Simple sparse rewards:

Team based scoring
-10 penalty if an agent escapes the

play area

no explicit rewards for interacting with objects

240 timesteps: 40% for hider preparation

Hiders	Seekers
+1 if none are seen	+1 if they find a hider
-1 if seen	-1 if they don’t find a hider

Worcester Polytechnic Institute

8 of 18

Agent Actions

8

Worcester Polytechnic Institute

9 of 18

Policy

Optimized using Proximal Policy Optimization (PPO) and General Advantage Estimation (GAE)

Architecture

Training via Rapid (OpenAI 2018) a large scale distributed RL framework

Worcester Polytechnic Institute

10 of 18

Results

Run away and chase

Hiders lock boxes

Seeker’s use ramps

Hiders hide ramps

Seekers perform block surfing

Hiders lock blocks to prevent surfing

25MM

100MM

110MM

380MM

450MM

10MM

Worcester Polytechnic Institute

11 of 18

Emerging Strategies

Strategy 1: Run and Chase

Strategy 2: Tools to Hide

Worcester Polytechnic Institute

12 of 18

Emerging Strategies

Strategy 3:Seeker Tool Use

to Counter Hider Tool Use

Strategy 4: Hider Counter Defense

to Seeker Tool Use

Worcester Polytechnic Institute

13 of 18

Emerging Strategies

Strategy 5: Seekers Counter

to Hiders Counter

Strategy 6:Hiders Counter

to Seekers Counter

Worcester Polytechnic Institute

14 of 18

Sensitivity Analysis

Batch size matters

Larger leads to quicker training
No convergence at 8K/16k

Fairly robust results with randomized environments

Fewer emergent strategies with reduced randomization

Using additional objects and objectives also lead to tool use

“providing further evidence that multi-agent interaction is a promising path towards self-supervised skill acquisition”

Worcester Polytechnic Institute

15 of 18

Evaluation

Proposed transfer as method to evaluate
Proposed 5 benchmark intelligent tests in 2 categories

Cognition and Memory

Object counting
Lock and return
Sequential Lock

Manipulation

Construction from blueprint
Shelter construction

Evaluated and compared

Pretrained hide and seek agent
Agent trained from scratch
Pretrained with count-based intrinsic motivation

Worcester Polytechnic Institute

TO evaluate the learning, the authors proposed 5 tests compared over 3 differently trained agents

The benchmarks included the cognition and memory functions of object counting, locking objects, and sequential locks as well as manipulation functions of construction from blueprint and shelter construction.

In object counting, the agent is pinned to a location and boxes move randomly in sight before hiding and the agent is asked to predict the total number of boxes.

In lock and return, the agent is measured on locking a box in a random location and returning to origin.

In sequential lock, the agent must lock 4 boxes in a particular order learned in the process.

In blueprint, the agent must place boxes according to a plan

And in shelter construction, the agent, given blocks, must build a shelter around a cylinder.

The agents tested included the pretrained hide and seek agent, agents trained from scratch and agents pretrained with count-based intrinsic motivation (where agents are explicitly rewarded for more exploration)

16 of 18

Comparison

Hide and seek performed

slightly better than others on Lock and Return, sequential lock and construction from blueprint,
slightly worse than count-based on Object counting
Same score but slower on Shelter construction

Worcester Polytechnic Institute

17 of 18

Conclusions

With simple rules, multi-agent competition, standard reinforcement agents learned complex strategies and skills
This setup, in sufficiently complex environment could lead to open-ended growth in complexity
Proof of concept that multi-agent autocurricula can lead to physically grounded, human relevant behavior

Worcester Polytechnic Institute

18 of 18

Future Research

Reducing sample complexity

Better policy learning algorithms
Better policy architectures

Because agents were adept at exploiting inaccuracies, methods to reduce inaccuracies in the representation of the environment that the agent could exploit (box surfing)

Worcester Polytechnic Institute