1 of 9

Domain Generalization in RL:

A Case Study with Snake

Owen Dugan and Gopal Goel

2 of 9

Environment Setup

  • Classic snake game, played on a 40 by 40 grid.
  • Reward of +1 for eating food, and reward -1 for death. Essentially optimizing for discounted length at time of death.
  • Assume agents have limited visibility -- agent controlling snake only sees a 11x11 grid centered at the head.

3 of 9

Training Methodology

  • PPO with actor and critic
    • Modified from CleanRL library
  • Training data: eight snake runs per epoch
    • Gamma = 0.999
    • Rollout steps = 1024
  • Between runs, we vary:
    • network structure
    • starting length

4 of 9

Trained Agents

All agents were trained on 1.44e7 timesteps

5 of 9

Evaluation Framework

  • Set up testing environments and let agents run
    • Estimate performance through average length at time of death
  • Test out-of-distribution scenarios to see if the agent’s learning is generalizable
    • Maze environment with death squares in the interior of the grid
    • Multiplayer snake

6 of 9

Results

7 of 9

Discussion

  • Multi-agent test results are identical to the training environment test
    • Most snakes die off quickly, and the last survivor is playing in effectively the training environment
    • Did not anticipate this behavior, but makes sense in hindsight
  • Occasionally two long snakes survive and display interesting competitive dynamics

8 of 9

Discussion

  • CNN agent substantially worse than other agents at training task
  • However, CNN agent performs much better on mazes!
    • All networks struggle on mazes, but less so for CNNs
  • Explanation: CNN strategy less sophisticated – requires more general skill
    • Other agents learn complex zigzag pattern
    • CNN does not – must be more robust to running into black squares

9 of 9

Acknowledgements

Thank you to Prof. Wu, Chanwoo Park, and Gilhyun Ryou for a great class!