1 of 10

Categorical Traffic Transformer

Yuxiao Chen, Sander Tonkens, and Marco Pavone

NVIDIA Research & UC San Diego

1

2 of 10

The role of prediction in the AV pipeline�

2

Perception

Prediction

Planning

What constitutes a good prediction model for downstream planner?

Scene-centric predictions

Multi-modal output (with probabilities)

3 of 10

CTT has a structured and interpretable latent�

3

Encoder

Decoder

Latent

Raw features

Trajectory predictions

CNN, GNN, Transformer, …

  • Gaussian latent spaces can suffer from:
    • Mode collapse
    • No control over the sampling process (e.g. lack of semantics)
    • Allows only for supervision of the decoded trajectory

  • Instead, CTT’s categorical latent enables:
    • Strong mode diversity
    • Clear semantic interpretation allowing for surjective mapping
    • Provides additional supervision signals (to decouple encoder and decoder training)

Gaussian

Categorical

Supervision

Future GT trajectory

GT latent mode

Supervision

Interpretable latent

4 of 10

CTT leverages the underlying structure present in driving to characterize categorical latent

4

Homotopies used to classify agent2agent interaction

  • 3 modes: S, CW, and CCW motion based on relative angles

M lane segment describe location of agent in scene

  • M lane segments
  • Agent2lane based on relative angle, and relative position to lane segment center point
    • Provides continuous distance metric for losses

5 of 10

CTT architecture & supervision signals

5

  • Encoder predicts the marginal distribution of homotopy and lane mode, and the joint scene mode
    • GT scene mode identified from GT future trajectories

  • Decoder predicts scene-centric future trajectories conditioned on scene mode samples. Reconstruction loss on trajectory predictions under GT scene mode, consistency loss on non-GT modes

 

 

6 of 10

Importance sampling is key for scalability�

6

 

7 of 10

Results sneak peak (check out our poster!)

7

Waymo

nuScenes

Alg.

ML �ADE

min

ADE

ML FDE

min

FDE

ST

1.72

3.98

MTR

0.92

2.06

JFP

0.87

1.96

AF

3.24

2.36

7.86

5.10

CTT

0.97

0.80

2.67

2.08

Alg.

ML ADE

min

ADE

ML FDE

min

FDE

ILVM

0.86

1.84

ScePT

0.86

1.84

AF

1.23

0.8

2.63

1.6

CTT

0.55

0.43

1.37

0.96

Each sample represents a distinct scene mode

Prediction accuracy

metrics

Qualitative results

Categorical mode accuracy

nuScenes

AF

CTT

a2l correct rate

74%

86%

a2l cover rate  

86%

90%

a2a correct rate

82%

89%

a2a cover rate 

92%

95%

SM correct rate

60%

76%

SM cover rate

78%

82%

WOMD

AF

CTT

a2l correct rate

78%

87%

a2l cover rate  

85%

93%

a2a correct rate

42%

79%

a2a cover rate 

44%

90%

SM correct rate

36%

70%

SM cover rate

39%

76%

8 of 10

Come check out our poster for more results!

  • Interpretable latent provides several benefits:
    • Direct supervision on latent variable
    • No mode collapse
    • Easy integration with other modules with semantic signals (e.g. LLMs)

  • Behavior-level modes are much more important and useful in practice than trajectory-level modes

More on my research

NVIDIA AVG group

Yuxiao Chen

Sander Tonkens

Marco Pavone

9 of 10

BACKUP SLIDES

9

10 of 10

Its network architecture combines transformer with GNN

10

  • Factorized attention across multiple axis and multiple variables (agent hist, agent future, lane segments, etc.)
  • GNN to explicitly embed edges and predict lane modes
  • Scene-centric, yet equivariant!
  • Decoder autoregressively generate actions then pass through dynamic models
  • Fully vectorized, support closed-loop training!