1 of 18

RBE595 RL Research Paper Review

Keith Chester, Bob DeMont

Emergent tool use from multi-agent interaction

Worcester Polytechnic Institute

2 of 18

Hide and Seek through Reinforcement Learning

Worcester Polytechnic Institute

3 of 18

Background Challenges

    • Researchers aimed to create agents that could solve human relevant tasks, especially tasks involving sensing and acting in physical worlds.
    • In supervised learning approaches, proper reward engineering requires time consuming human input and fails to scale to increasingly complex tasks.
    • Alternatively, unsupervised learning approaches (such as intrinsic motivation) scale poorly and don’t mirror natural learning

Worcester Polytechnic Institute

4 of 18

Related Previous Work

  • Early work explored self play using genetic algorithms
  • More recent explored conditions for open-ended evolution and co-adaptation between agents and environments
  • Self play with deep RL led to superior results in Backgammon, Go, Dota and others
  • Finally multi-agents trained in a 3D physics

environment learned emergent behaviors

like passing in soccer simulations

  • Other work involved tool use; but the agents

were incentivized for the use

Worcester Polytechnic Institute

5 of 18

Key Concepts

  • Natural learning evolved through co-evolution and competition (natural selection)
  • Coevolution and competition create autocurricula - where competing agents constantly create new tasks and challenges for each other
  • Previous work has been done in complex and physical grounded environments

Worcester Polytechnic Institute

6 of 18

Whats New

  • A mixed competitive/ cooperative physics based environment in which agents play hide and seek.
  • Agents create their own incentive for tool use through multi-agent competition
  • Main contributions of the work:
    • Emergent autocurricula in multi-agent self play with strategy shifts
    • Grounded in a physical environment, autocurricula can lead to human relevant skills like tool use
    • Proposal of a framework for evaluating agents in open-ended env
    • Open sourced environment and code to encourage research

Worcester Polytechnic Institute

7 of 18

Setup

  • The game: Hide and Seek
    • Preparation period where hiders are free to move and seekers are “locked”
  • The environment: objects agents can grab or lock objects in place and random immovable walls exist. Power by MuJoCo physics engine
  • Simple sparse rewards:
    • Team based scoring
    • -10 penalty if an agent escapes the

play area

    • no explicit rewards for interacting with objects
  • 240 timesteps: 40% for hider preparation

Hiders

Seekers

+1 if none are seen

+1 if they find a hider

-1 if seen

-1 if they don’t find a hider

Worcester Polytechnic Institute

8 of 18

Agent Actions

8

Worcester Polytechnic Institute

9 of 18

Policy

  • Optimized using Proximal Policy Optimization (PPO) and General Advantage Estimation (GAE)

  • Architecture

  • Training via Rapid (OpenAI 2018) a large scale distributed RL framework

Worcester Polytechnic Institute

10 of 18

Results

Run away and chase

Hiders lock boxes

Seeker’s use ramps

Hiders hide ramps

Seekers perform block surfing

Hiders lock blocks to prevent surfing

25MM

100MM

110MM

380MM

450MM

10MM

Worcester Polytechnic Institute

11 of 18

Emerging Strategies

Strategy 1: Run and Chase

Strategy 2: Tools to Hide

Worcester Polytechnic Institute

12 of 18

Emerging Strategies

Strategy 3:Seeker Tool Use

to Counter Hider Tool Use

Strategy 4: Hider Counter Defense

to Seeker Tool Use

Worcester Polytechnic Institute

13 of 18

Emerging Strategies

Strategy 5: Seekers Counter

to Hiders Counter

Strategy 6:Hiders Counter

to Seekers Counter

Worcester Polytechnic Institute

14 of 18

Sensitivity Analysis

  • Batch size matters
    • Larger leads to quicker training
    • No convergence at 8K/16k

  • Fairly robust results with randomized environments
    • Fewer emergent strategies with reduced randomization

  • Using additional objects and objectives also lead to tool use

“providing further evidence that multi-agent interaction is a promising path towards self-supervised skill acquisition”

Worcester Polytechnic Institute

15 of 18

Evaluation

  • Proposed transfer as method to evaluate
  • Proposed 5 benchmark intelligent tests in 2 categories
    • Cognition and Memory
      • Object counting
      • Lock and return
      • Sequential Lock
    • Manipulation
      • Construction from blueprint
      • Shelter construction
  • Evaluated and compared
    • Pretrained hide and seek agent
    • Agent trained from scratch
    • Pretrained with count-based intrinsic motivation

Worcester Polytechnic Institute

16 of 18

Comparison

  • Hide and seek performed
    • slightly better than others on Lock and Return, sequential lock and construction from blueprint,
    • slightly worse than count-based on Object counting
    • Same score but slower on Shelter construction

Worcester Polytechnic Institute

17 of 18

Conclusions

  • With simple rules, multi-agent competition, standard reinforcement agents learned complex strategies and skills
  • This setup, in sufficiently complex environment could lead to open-ended growth in complexity
  • Proof of concept that multi-agent autocurricula can lead to physically grounded, human relevant behavior

Worcester Polytechnic Institute

18 of 18

Future Research

  • Reducing sample complexity
    • Better policy learning algorithms
    • Better policy architectures
  • Because agents were adept at exploiting inaccuracies, methods to reduce inaccuracies in the representation of the environment that the agent could exploit (box surfing)

Worcester Polytechnic Institute