1 of 60

Forecasting &

Trajectory Prediction

Presenters:

Charan N. & Matthew H.

2 of 60

Outline

  • Motivation
  • LSTM / RNNs
  • Trajectron++
  • Multimodal Trajectory Predictions
  • Multi-Agent Tensor Fusion
  • MultiPath
  • ChaufferNet - Quick 3 min summary
  • CoverNet - Quick 3 min summary

3 of 60

Motivation

  • How do humans drive?
    • See and understand various objects
    • Anticipate other agents’ actions/trajectories
    • Plan trajectory and how to control the car
  • Some AD Topics this semester
    • Object Detection
    • Depth Estimation
    • Point Clouds

4 of 60

Framework

  • Input
    • Vehicles, Pedestrians, “Agents”
    • Ego-Agent
    • Obstacles
    • Maps (w/ varying levels of detail)
    • Agent history is stored in LSTMs
  • Predict trajectories of other agents
  • Plan trajectory of ego-agent
  • Output
    • Behaviors
    • Control Commands

5 of 60

Long Short Term Memory (LSTM) Network

  • A recurrent neural network (RNN) can be thought of as multiple copies of the same network, each passing a message to a successor.
  • RNN has a long-term dependency problem
  • LSTM is a special kind of RNN designed to avoid this

Recurrent Neural Network

LSTM

6 of 60

Forget gate layer

Input gate layer

7 of 60

8 of 60

Gated Recurrent Unit (GRU)

Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

9 of 60

Outline

  • Motivation
  • LSTM
  • Trajectron++
  • Multimodal Trajectory Predictions
  • Multi-Agent Tensor Fusion
  • MultiPath
  • ChaufferNet
  • CoverNet

10 of 60

Paper 1:

Trajectron++: Dynamically-Feasible Trajectory Forecasting With Heterogeneous Data

Published in 2021

Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and Marco Pavone

11 of 60

Trajectron++ - Introduction

  • Predicting future behavior is important for autonomous driving systems
  • Current prediction methods for trajectory:
  • Deterministic regressors
    • Determine one future trajectory
    • Time-series regression problem
    • Models: GPR, Kalman Filters, Inverse Reinforcement Learning, RNNs
  • Generative, Probabilistic models
    • Produce a distribution of potential future trajectories
    • Better than deterministic regressors due to this multi-modality
    • Utilizes GAN, CVAE + GMM
    • Models: MATF, Trajectron

12 of 60

Trajectron++ - Introduction

  • These methods ignore real-world use cases
    • Agents’ dynamic constraints
    • Ego-Agent’s movement
    • Easily extensible framework for heterogeneous / environmental information

  • Goals

13 of 60

14 of 60

Trajectron++ - Architecture

  • Spatiotemporal Graph
  • Nodes
    • Agents
    • Individual semantic class
  • Edges
    • Interaction between agents
    • Directed edges - perception

15 of 60

Trajectron++ - Architecture

  • Agents - “Nodes”
    • Current State
    • History - stored in LSTMs
  • Agent Interactions - “Edges”
    • Encoded using attention
    • Utilizes agents’ histories
  • Heterogeneous Data
    • Map & Localization Data - CNN & FC Map
    • Other data sources: LIDAR, Camera Images, Gaze Direction
  • Encode future ego-agent motion plans
    • Trajectory prediction
    • Learn based on ground truth output

16 of 60

Trajectron++ - Architecture

  • Scene representation w/agents and ego-agent interactions
  • Multi-modality
    • CVAE - Encode behaviors as a probability
    • Z - Categorial high-level latent behavior
  • Decoder
    • Encoded all input info w/multiple agents, dynamics constraints, heterogeneous data
    • GRU - used to decode the state
    • GMM - used to steer and accelerate/brake
    • Dynamics integration

17 of 60

Trajectron++ - Output Configurations

  • Most Likely

  • Z mode

  • Full

  • Distribution

18 of 60

Trajectron++ - Training and Evaluation

  • Training

  • Evaluation Metrics
    • Average Displacement Error (ADE)
    • Final Displacement Error (FDE)
    • Kernel Density Estimate-based Negative Log Likelihood (KDE NLL)
    • Best of N (BoN)

19 of 60

Trajectron++ - Results

  • Deterministic methods comparison
  • Probabilistic methods comparison

  • KDE NLL

20 of 60

Trajectron++ - Results

  • nuScenes Dataset

  • Ego-Vehicle

21 of 60

Trajectron++ - Summary

  • Probabilistic Trajectories
  • Dynamics
  • Heterogeneous Data
    • Map Data
    • Images
    • LIDAR

22 of 60

Outline

  • Motivation
  • LSTM
  • Trajectron++
  • Multimodal Trajectory Predictions
  • Multi-Agent Tensor Fusion
  • MultiPath
  • ChaufferNet
  • CoverNet

23 of 60

Paper 2:

Multimodal Trajectory Predictions for Autonomous

Driving using Deep Convolutional Networks

Published in 2019

Henggang Cui, Vladan Radosavljevic, Fang-Chieh Chou, Tsung-Han Lin, Thi Nguyen, Tzu-Kuo Huang, Jeff Schneider and Nemanja Djuric

24 of 60

Multimodal Trajectory Prediction (MTP)

  • Correctly predicting movement of surrounding actors is a critical piece of autonomous puzzle.
  • We also need to account for the actor’s multimodal nature
  • MTP is a method that goes beyond single trajectory and instead gives multiple trajectories and their probabilities.
  • This method uses rasterized vehicle context (including the high-definition map and other actors) as a model input to predict actor’s movement in a dynamic environment

25 of 60

Engineered approaches

  1. Kalman Filter
  2. Works well for short term predictions
  3. Performance degrades for longer horizons

These models fail to scale to many different traffic scenarios

Machine-Learning approaches

  1. LSTM based
  2. GRU-CVAE
  3. Mixture density networks (MDN)
  4. solve multimodal regression tasks by learning parameters of a Gaussian mixture model.

26 of 60

MobileNet-v2

27 of 60

Mixture-of-Experts (ME) loss

(mode collapse problem)

where single-mode loss

Multiple-Trajectory Prediction (MTP) loss

Where classification class-entropy loss

Multimodal Loss

28 of 60

29 of 60

30 of 60

Outline

  • Motivation
  • LSTM
  • Trajectron++
  • Multimodal Trajectory Predictions
  • Multi-Agent Tensor Fusion
  • MultiPath
  • ChaufferNet
  • CoverNet

31 of 60

Paper 3:

Multi-Agent Tensor Fusion for Contextual Trajectory Prediction

Published in 2019

Tianyang Zhao, Yifei Xu, Mathew Monfort, Wongun Choi, Chris Baker, Yibiao Zhao, Yizhou Wang, Ying Nian Wu

32 of 60

Multi-Agent Tensor Fusion network (MATF)

  • Agents motions are stochastic and dependent on their goals, social interaction with others and the scene context
  • Encoding this info is difficult for NN-based approaches, because they prefer fixed input, output and parameter dimensions while for prediction task these dimensions vary.
  • This issue is addressed by two types of encoding
    • Agent-centric : apply aggregation functions on multiple agents feature vectors
    • Spatial-centric : operate directly on top-down representations of the scene
  • MATF combines the strengths of agent- and spatial-centric approaches

33 of 60

MATF

Inputs:

  1. Past trajectories of multiple dynamic interacting agents.

n agent past trajectories

  1. Bird’s-eye view scene context image c

Output:

  • Predicted future trajectories of all agents

34 of 60

Multi-Agent Tensor

- Encode past trajectories of each individual agent independently using single-agent LSTM encoder (each LSTM encoder shares same parameters, making architecture invariant to num of agents in scene)

- Output is encoded state vectors/ agent vectors

- Parallelly encode static scene context image c with CNN.

- Output is a scaled feature map c retaining spatial structure

35 of 60

MATF Encoding

- The agent encodings are placed into the spatial tensor with respect to their positions at the last time step of their past trajectories.

- This tensor is then concatenated with the encoded scene image in the channel dimension to get a combined tensor called Multi-Agent Tensor

- U-Net like fully convolutional spatial fusion layers are applied on Multi-Agent Tensor to output a fused Multi-Agent Tensor

36 of 60

MATF Decoding

Fused vectors for each agent are sliced out according to their coordinates from the fused Multi-Agent Tensor output

These agent specific representations are then added as residual to original encoded agent vectors to form final agent encoding vectors

For each agent in the scene, its final vector is decoded to future trajectory prediction by LSTM decoders.

37 of 60

Step 1:

—----------------->

Single-Agent LSTM Encoders

Multi-Agent past trajectories

Agent Encodings

—------------->

Scene context image

CNN

Encoded scene context image

Step 2:

Encoded agent vectors and encoded scene context are concatenated spatially to form a Multi-Agent Tensor

Step 3:

Multi-Agent Tensor —------------> fused Multi-Agent Tensor

U-Net

38 of 60

Multi-Agent Tensor

MATF Decoding

39 of 60

Step 4:

Fused vectors for each agent are sliced out from fused Multi-Agent Tensor

contains interaction, history and constraint features for the corresponding agent

Step 5:

+

Original encoded vectors

Fused vectors

Final agent encoding vectors

—--------------->

Step 6:

—---------------------->

Final agent encoding vectors

Future trajectory predictions

Single-Agent LSTM Decoder

40 of 60

MATF GAN

- used to learn stochastic generative model.

Generator -

  • similar to MATF but while decoding, final agent vectors are concatenated with Gaussian white noise vector z.

Discriminator -

  • Same as MATF except its single agent LSTM takes in past and future trajectories as input instead of just past.
  • Final agent encodings are fed into fully connected layers to be classified as real or fake

41 of 60

Losses

Deterministic

  1. Reconstruction Loss

Stochastic generative model

  1. Adversarial loss

  1. Loss used to train MATF GAN

42 of 60

Results and Ablation Studies

43 of 60

44 of 60

Outline

  • Motivation
  • LSTM
  • Trajectron++
  • Multimodal Trajectory Predictions
  • ChaufferNet
  • Multi-Agent Tensor Fusion
  • MultiPath
  • CoverNet

45 of 60

Paper 4:

ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst

Published in 2018

Mayank Bansal, Alex Krizhevsky, Abhijit Ogale

46 of 60

ChauffeurNet - Introduction

  • How do people drive cars?
  • Goal: Get imitation learning to learn how to drive a car
    • 30 million driving examples => 60 days of driving
    • RNN - ChauffeurNet
    • Use mid-level input from perception systems (control model complexity)
    • Outputs a driving trajectory (given to control components)
  • Pure imitation learning is insufficient
    • Include losses that discourage bad behavior & encourage progress
    • Expose to non-expert behavior (ex: collisions, off-road driving, etc)

47 of 60

ChaufferNet - Architecture

  • Inputs

  • Training the Model
    • ChauffeurNet RNN
    • Perception RNN
    • Training Losses w/Ground Truth data

48 of 60

ChaufferNet - Training

  • Imitation Loss Functions
  • Dropout

49 of 60

ChaufferNet - Training

  • Imitation Learning is NOT Enough
  • Simulate Perturbations & Collisions
  • Loss
    • Collision loss
    • Off-road loss
    • Geometry loss
    • Objects loss

50 of 60

ChaufferNet - Results

51 of 60

Outline

  • Motivation
  • LSTM
  • Trajectron++
  • Multimodal Trajectory Predictions
  • Multi-Agent Tensor Fusion
  • ChaufferNet
  • MultiPath

52 of 60

Paper 5:

MultiPath: Multiple Probabilistic Anchor Trajectory Hypotheses for Behavior Prediction

Published in 2019

Yuning Chai, Benjamin Sapp, Mayank Bansal, Dragomir Anguelov

53 of 60

MultiPath model

Goal: diversity and coverage

  • Employs a fixed set of trajectory anchors as basis for modeling.
  • TA are modes found in our training data in state-sequence space via unsupervised learning
  • Intent uncertainty (what) is encoded as a distribution over the set of anchor trajectories.
  • Control uncertainty (how) is normally distributed at each future time step

54 of 60

55 of 60

ResNet based architecture

3D feature map of entire top-down space

Extract 11 x 11 size patches centered on agent

4 convolutional layers with kernel size 3 and 8/16 channels

produces K x T x 5 parameters

Input: 3-dimensional array of data rendered from a top-down(simple) orthographic perspective (x,y,t)

Desired output:

1. A parametric distribution over future trajectories s: p(s|x)

2. A compact weighted set of explicit trajectories which summarizes this distribution well

56 of 60

where

k-means algorithm is used to obtain A with squared distance between trajectories

where Mu;Mv are affine transformation matrices which put trajectories into a canonical rotation- and translation-invariant agent-centric coordinate frame

Intent Uncertainty —>

Control Uncertainty —>

Gaussian parameters predicted by model

Softmax distribution

Gaussian distribution

Scene-specific offset from anchor state

57 of 60

  • Can predict all time steps jointly with single inference pass, making the model simple to train and efficient to evaluate
  • a Gaussian mixture model (GMM) at each time step, with mixture weights fixed overtime

Negative log-likelihood loss

58 of 60

59 of 60

60 of 60