JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 9

Domain Generalization in RL:

A Case Study with Snake

Owen Dugan and Gopal Goel

2 of 9

Environment Setup

Classic snake game, played on a 40 by 40 grid.
Reward of +1 for eating food, and reward -1 for death. Essentially optimizing for discounted length at time of death.
Assume agents have limited visibility -- agent controlling snake only sees a 11x11 grid centered at the head.

3 of 9

Training Methodology

PPO with actor and critic

Modified from CleanRL library

Training data: eight snake runs per epoch

Gamma = 0.999
Rollout steps = 1024

Between runs, we vary:

network structure
starting length

https://arxiv.org/pdf/1707.06347.pdf

4 of 9

Trained Agents

All agents were trained on 1.44e7 timesteps

5 of 9

Evaluation Framework

Set up testing environments and let agents run

Estimate performance through average length at time of death

Test out-of-distribution scenarios to see if the agent’s learning is generalizable

Maze environment with death squares in the interior of the grid
Multiplayer snake

6 of 9

Results

7 of 9

Discussion

Multi-agent test results are identical to the training environment test

Most snakes die off quickly, and the last survivor is playing in effectively the training environment
Did not anticipate this behavior, but makes sense in hindsight

Occasionally two long snakes survive and display interesting competitive dynamics

8 of 9

Discussion

CNN agent substantially worse than other agents at training task
However, CNN agent performs much better on mazes!

All networks struggle on mazes, but less so for CNNs

Explanation: CNN strategy less sophisticated – requires more general skill

Other agents learn complex zigzag pattern
CNN does not – must be more robust to running into black squares

9 of 9

Acknowledgements

Thank you to Prof. Wu, Chanwoo Park, and Gilhyun Ryou for a great class!