1 of 24

Paper review

  • GREIL-Crowds: Crowd Simulation with �Deep Reinforcement Learning and Examples (SIGGRAPH 2023)

Yongwoo Lee

2 of 24

Previous work

  • Configurable Crowd Profiles(SIGGRAPH 2022)
    • Each agent has multiple behaviors concurrently, with an individual profile(run-time variable)
      • Goal seeking, Collision avoidance, Grouping, Interaction

3 of 24

Previous work

  • Configurable Crowd Profiles(SIGGRAPH 2022)
    • Each agent has multiple behaviors concurrently, with an individual profile(run-time variable)
      • Goal seeking, Collision avoidance, Grouping, Interaction
    • Plausible results, but different from real-world data

4 of 24

Previous work

  • Data-driven approaches in crowd simulation
    • The PAG Crowd: A Graph Based Approach for Efficient Data-Driven Crowd Simulation (CGF 2014)
      • The results highly depend on the variability and amount of input data
        • Sometimes simulation result is not efficient.
      • Cannot preserve long time consistency(history and future consequences)
        • Inconsistency in the overall trajectories
        • Errors are accumulated over time
      • Results are still different from original data.
        • How can we integrate simulated results with real-world data?

5 of 24

Introduction

  • In summary,
    • RL-based approach : manual reward design, not easy to capture real-world human behavior
    • Data-driven approach : no long consistency, not efficient
  • GREIL-Crowds

6 of 24

Introduction

  • Framework : GREIL-Crowds
    • RL framework that can reproduce plausible and consistent �crowd behaviors from a relative small set of input crowd data.�
    • An implicit form of reward formulation based on a data-driven novelty detection.�
    • We employ Q-learning with discrete action spaces to find policies for crowd agents.
      • reference : Learning to Schedule Control Fragments for Physics-Based Characters Using Deep Q-Learning(SIGGRAPH 2017)
      • [video] [video2]
      • Then, what is the state, action and reward?

7 of 24

Methods

  • State : local coordinate system of an agent (with its facing direction)
    • Observations of the environment
      • Positions and velocities of the agent and its neighbors
      • Agent’s goal position
    • 3 kinds of state components
      • s_a = magnitude of the current velocity.
      • s_g = relative velocity compared to desired velocity
      • s_n = closest neighborhood info in predefined arcs. � (distance, relative velocity)

8 of 24

Methods

  • Action {ai} ∊ℝ2.
    • Acceleration (initial state = (0,0)), 0.5s length windows in the data�
    • Discrete values (Like control fragments)�
    • Used Kernel Density Estimation(KDE) with grid 100 x 100
      • sample 50 values -> Actions to be used

9 of 24

Methods

  • Data-driven Reward function
    • Previous work) RL with manually defined task rewards
      • Goal seeking
      • Collision avoidance
      • Grouping
      • Interaction
    • However, there are unpredictable behaviors
      • Standing still,
      • Changing directions without any obvious reasons

    • Instead of manually defined reward function, we use novelty detection,�Which is an implicit representation from the crowd data.

10 of 24

Methods

  • Novelty detection
    • := how much common(In-lier) the agent state is.
    • By k-Nearest Neighbors, each state is assigned to a score value Rs.�Score is defined as the distance to the most nearest neighbor,���

11 of 24

Methods

  • Novelty detection
    • := how much common(In-lier) the agent state is.
    • By k-Nearest Neighbors, each state is assigned to a score value Rs.�Score is defined as the distance to the most nearest neighbor,���
    • Anomaly(Outlier) score pK(s) is, ����
    • Inlier score is, ��

12 of 24

Methods

  • Novelty detection
    • Sometimes people tend to prefer common situations,�some other times they prefer rare situations.

How does the reward function R satisfy both cases from the data?

      • We used Localized p-value estimation (k-LPE)[1]
      • If we define a reward function based on�“How much data that are lower inliner score than the state?”��
  • So, reward function is..�
    • Evaluating a transition, only the s’ is considered.
    • Intuitively, this choice helps agents to visit
      • both rare and common states proportionally to the probability�they were generated by the source state distribution.

13 of 24

Methods

  • Overview

14 of 24

Methods

  • Training Q-Network
    • Input : State
    • Output : A vector of Q(s,a) for all ∀aA
    • 4 MLP[128,128,64,64]��
  • Training strategy : More guided one than randomly started ones
    • Playback agents
      • Select random segments of reference data to initialize the agent
      • Let the rest to follow their original trajectories
    • Simulated agents
      • Take actions from currently learned policy
      • Take completely random actions
      • Take actions based on statistics of action sequences from reference data.

15 of 24

Experiment & Evaluation

  • Evaluation
    • Please refer to [supplementary video].

16 of 24

Discussion

  • Future work
    • Learning Strategy
      • When data can ensure the quality,
        • Combine other improvements to increase the performance
    • Continuous Actions
      • Actor-Critic based RL methods will be used.
    • Reward Function
      • k-NN => search speed in high dimensions
      • Does not scale well with large amounts of data.
      • Alternative) learn reward function concurrently to the policy learning.
    • Heterogeneity
      • Additional control signal can be used.

17 of 24

Methods

  • Novelty detection
    • k-Nearest Neighbors in the state space of input data.���

: k-nearest neighbors

18 of 24

  • Novelty detection
    • k-Nearest Neighbors in the state space of input data.
    • For a test point si, �distance score R(si) = Farthest nearest neighbor.���

: k-nearest neighbors

Methods

R(si)

: test point

19 of 24

  • Novelty detection
    • k-Nearest Neighbors in the state space of input data.
    • For a test point si, �distance score R(si) = Farthest nearest neighbor.���

: k-nearest neighbors

Methods

R(si)

: test point

20 of 24

  • Novelty detection
    • k-Nearest Neighbors in the state space of input data.
    • Inlier score := How common am I, compared to data points? ���

: k-nearest neighbors

Methods

: test point

21 of 24

  • Novelty detection
    • k-Nearest Neighbors in the state space of input data.
    • Inlier score := How common am I, compared to data points? ���

: k-nearest neighbors

Methods

: test point

22 of 24

  • Novelty detection
    • k-Nearest Neighbors in the state space of input data.
    • Reward function : Outlier score = 1 - pK(s)
      • ���

: k-nearest neighbors

Methods

: test point

23 of 24

  • Novelty detection
    • k-Nearest Neighbors in the state space of input data.
    • Reward function : Outlier score = 1 - pK(s)

    • Outlier (Rare) => High reward
    • Inlier (Common) => Low reward���

: k-nearest neighbors

Methods

: test point

24 of 24

Methods