1 of 24

Paper review

GREIL-Crowds: Crowd Simulation with �Deep Reinforcement Learning and Examples (SIGGRAPH 2023)

Yongwoo Lee

2 of 24

Previous work

Configurable Crowd Profiles(SIGGRAPH 2022)

Each agent has multiple behaviors concurrently, with an individual profile(run-time variable)

Goal seeking, Collision avoidance, Grouping, Interaction

3 of 24

Previous work

Configurable Crowd Profiles(SIGGRAPH 2022)

Each agent has multiple behaviors concurrently, with an individual profile(run-time variable)

Goal seeking, Collision avoidance, Grouping, Interaction

Plausible results, but different from real-world data

4 of 24

Previous work

Data-driven approaches in crowd simulation

The PAG Crowd: A Graph Based Approach for Efficient Data-Driven Crowd Simulation (CGF 2014)

The results highly depend on the variability and amount of input data

Sometimes simulation result is not efficient.

Cannot preserve long time consistency(history and future consequences)

Inconsistency in the overall trajectories
Errors are accumulated over time

Results are still different from original data.

How can we integrate simulated results with real-world data?

5 of 24

Introduction

In summary,

RL-based approach : manual reward design, not easy to capture real-world human behavior
Data-driven approach : no long consistency, not efficient

GREIL-Crowds

6 of 24

Introduction

Framework : GREIL-Crowds

RL framework that can reproduce plausible and consistent �crowd behaviors from a relative small set of input crowd data.�
An implicit form of reward formulation based on a data-driven novelty detection.�
We employ Q-learning with discrete action spaces to find policies for crowd agents.

reference : Learning to Schedule Control Fragments for Physics-Based Characters Using Deep Q-Learning(SIGGRAPH 2017)
[video] [video2]
Then, what is the state, action and reward?

7 of 24

Methods

State : local coordinate system of an agent (with its facing direction)

Observations of the environment

Positions and velocities of the agent and its neighbors
Agent’s goal position

3 kinds of state components

s_a = magnitude of the current velocity.
s_g = relative velocity compared to desired velocity
s_n = closest neighborhood info in predefined arcs. � (distance, relative velocity)

8 of 24

Methods

Action {a_i} ∊ℝ².

Acceleration (initial state = (0,0)), 0.5s length windows in the data�
Discrete values (Like control fragments)�
Used Kernel Density Estimation(KDE) with grid 100 x 100

sample 50 values -> Actions to be used

9 of 24

Methods

Data-driven Reward function

Previous work) RL with manually defined task rewards

Goal seeking
Collision avoidance
Grouping
Interaction

However, there are unpredictable behaviors

Standing still,
Changing directions without any obvious reasons

Instead of manually defined reward function, we use novelty detection,�Which is an implicit representation from the crowd data.

10 of 24

Methods

Novelty detection

:= how much common(In-lier) the agent state is.
By k-Nearest Neighbors, each state is assigned to a score value Rs.�Score is defined as the distance to the most nearest neighbor,��

11 of 24

Methods

Novelty detection

:= how much common(In-lier) the agent state is.
By k-Nearest Neighbors, each state is assigned to a score value Rs.�Score is defined as the distance to the most nearest neighbor,��
Anomaly(Outlier) score p_K(s) is, ��
Inlier score is, ��

12 of 24

Methods

Novelty detection

Sometimes people tend to prefer common situations,�some other times they prefer rare situations.

How does the reward function R satisfy both cases from the data?

We used Localized p-value estimation (k-LPE)^[1]
If we define a reward function based on�“How much data that are lower inliner score than the state?”��

So, reward function is..�

Evaluating a transition, only the s’ is considered.
Intuitively, this choice helps agents to visit

both rare and common states proportionally to the probability�they were generated by the source state distribution.

^[1] Anomaly Detection with Score functions based on Nearest Neighbor Graphs (NIPS 2009)

13 of 24

Methods

Overview

14 of 24

Methods

Training Q-Network

Input : State
Output : A vector of Q(s,a) for all ∀a ∊ A
4 MLP[128,128,64,64]��

Training strategy : More guided one than randomly started ones

Playback agents

Select random segments of reference data to initialize the agent
Let the rest to follow their original trajectories

Simulated agents

Take actions from currently learned policy
Take completely random actions
Take actions based on statistics of action sequences from reference data.

15 of 24

Experiment & Evaluation

Evaluation

Please refer to [supplementary video].

16 of 24

Discussion

Future work

Learning Strategy

When data can ensure the quality,

Combine other improvements to increase the performance

Continuous Actions

Actor-Critic based RL methods will be used.

Reward Function

k-NN => search speed in high dimensions
Does not scale well with large amounts of data.
Alternative) learn reward function concurrently to the policy learning.

Heterogeneity

Additional control signal can be used.

17 of 24

Methods

Novelty detection

k-Nearest Neighbors in the state space of input data.��

: k-nearest neighbors

18 of 24

Novelty detection

k-Nearest Neighbors in the state space of input data.
For a test point s_i, �distance score R(s_i) = Farthest nearest neighbor.��

: k-nearest neighbors

Methods

R(s_i)

: test point

19 of 24

Novelty detection

k-Nearest Neighbors in the state space of input data.
For a test point s_i, �distance score R(s_i) = Farthest nearest neighbor.��

: k-nearest neighbors

Methods

R(s_i)

: test point

20 of 24

Novelty detection

k-Nearest Neighbors in the state space of input data.
Inlier score := How common am I, compared to data points? ��

: k-nearest neighbors

Methods

: test point

21 of 24

Novelty detection

k-Nearest Neighbors in the state space of input data.
Inlier score := How common am I, compared to data points? ��

: k-nearest neighbors

Methods

: test point

22 of 24

Novelty detection

k-Nearest Neighbors in the state space of input data.
Reward function : Outlier score = 1 - p_K(s)

��

: k-nearest neighbors

Methods

: test point

23 of 24

Novelty detection

k-Nearest Neighbors in the state space of input data.
Reward function : Outlier score = 1 - p_K(s)

Outlier (Rare) => High reward
Inlier (Common) => Low reward��

: k-nearest neighbors

Methods

: test point

1 of 24

2 of 24

3 of 24

4 of 24

5 of 24

6 of 24

7 of 24

8 of 24

9 of 24

10 of 24

11 of 24

12 of 24

13 of 24

14 of 24

15 of 24

16 of 24

17 of 24

18 of 24

19 of 24

20 of 24

21 of 24

22 of 24

23 of 24

24 of 24