Categorical Traffic Transformer
Yuxiao Chen, Sander Tonkens, and Marco Pavone
NVIDIA Research & UC San Diego
1
The role of prediction in the AV pipeline�
2
Perception
Prediction
Planning
What constitutes a good prediction model for downstream planner?
Scene-centric predictions
Multi-modal output (with probabilities)
CTT has a structured and interpretable latent�
3
Encoder
Decoder
Latent
Raw features
Trajectory predictions
CNN, GNN, Transformer, …
Gaussian
Categorical
Supervision
Future GT trajectory
GT latent mode
Supervision
Interpretable latent
CTT leverages the underlying structure present in driving to characterize categorical latent
4
Homotopies used to classify agent2agent interaction
M lane segment describe location of agent in scene
CTT architecture & supervision signals
5
Importance sampling is key for scalability�
6
Results sneak peak (check out our poster!)
7
Waymo
nuScenes
Alg. | ML �ADE | min ADE | ML FDE | min FDE |
ST | – | 1.72 | – | 3.98 |
MTR | – | 0.92 | – | 2.06 |
JFP | – | 0.87 | – | 1.96 |
AF | 3.24 | 2.36 | 7.86 | 5.10 |
CTT | 0.97 | 0.80 | 2.67 | 2.08 |
Alg. | ML ADE | min ADE | ML FDE | min FDE |
ILVM | – | 0.86 | – | 1.84 |
ScePT | – | 0.86 | – | 1.84 |
AF | 1.23 | 0.8 | 2.63 | 1.6 |
CTT | 0.55 | 0.43 | 1.37 | 0.96 |
Each sample represents a distinct scene mode
Prediction accuracy
metrics
Qualitative results
Categorical mode accuracy
nuScenes | AF | CTT |
a2l correct rate | 74% | 86% |
a2l cover rate | 86% | 90% |
a2a correct rate | 82% | 89% |
a2a cover rate | 92% | 95% |
SM correct rate | 60% | 76% |
SM cover rate | 78% | 82% |
WOMD | AF | CTT |
a2l correct rate | 78% | 87% |
a2l cover rate | 85% | 93% |
a2a correct rate | 42% | 79% |
a2a cover rate | 44% | 90% |
SM correct rate | 36% | 70% |
SM cover rate | 39% | 76% |
Come check out our poster for more results!
More on my research
NVIDIA AVG group
Yuxiao Chen
Sander Tonkens
Marco Pavone
BACKUP SLIDES
9
Its network architecture combines transformer with GNN
10