Roadster:
Road Sustainable Twins in Emilia Romagna
Introduction
*”ROADSTER does not deal with the adoption of AI. It instead focuses on the design and production of new AI technologies in the research centers involved in the project. “
Introduction
Activities carried out, until now:
*”ROADSTER does not deal with the adoption of AI. It instead focuses on the design and production of new AI technologies in the research centers involved in the project. “
Transformer-based models for Visual Recognition and Semantic Segmentation
Vision Transformer (ViT) achieves state-of-the-art in Classification
[1] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J. “An image is worth 16x16 words: Transformers for image recognition at scale”. ICLR 2021.
Advantages
Vision Transformer (ViT) [1]
Transformer Encoder
9
8
7
6
5
4
3
2
1
0
Patch + Positional
Embedding
Extra CLS token
FC
Dog
Cat
Bird
…
Linear Projection
Drawbacks: high computational cost & lack of hierarchical features
Disadvantages
VT2D: a Hierarchical ViT with Bidimensional Max Pooling
Our Proposal:
VT2D: a Hierarchical ViT with 2D Max Pooling
VT2D: a Hierarchical ViT with 2D Max Pooling
Linear Projection
0
1
2
3
12
13
14
15
Patch
Without CLS token
+ Pos
Transformer Layer
VT2D: a Hierarchical ViT with 2D Max Pooling
Linear Projection
0
1
2
3
12
13
14
15
Transformer Layer
2D Max Pooling Layer
Patch
Without CLS token
+ Pos
VT2D: a Hierarchical ViT with 2D Max Pooling
xM
Linear Projection
0
1
2
3
12
13
14
15
Transformer Layer
2D Max Pooling Layer
0
1
2
3
Transformer Layer
Transformer Layer
Average Pooling
FC
Dog
Cat
Bird
…
Patch
Without CLS token
+ Pos
Investigating 2D Max Pooling at different stages
Linear Projection
0
1
2
3
28
29
30
31
Patch
Without CLS token
+ Pos
2D Max Pooling
Transformer Layer
Transformer Layer
Transformer Layer
Average Pooling
FC
Dog
Cat
Bird
…
2D Max Pooling
Transformer Layer
Transformer Layer
Transformer Layer
2D Max Pooling
Transformer Layer
Transformer Layer
Transformer Layer
0
1
2
3
2D Max Pooling
Transformer Layer
Transformer Layer
Transformer Layer
0
1
2
3
12
13
14
15
0
1
2
3
4
5
6
7
0
1
Stage 1
Stage 2
Stage 3
Stage 4
VT2D-4: 2D Pooling layer in all 4 stages
VT2D-Small: Experimental Results on CIFAR-100
Pooling with small kernel size
Pooling with large kernel size
Performance Comparison: Accuracy vs. FLOPs
VT (no pooling)
VT1D
VT2D-1
VT2D-2
VT2D-4
5.6 M
parameters
21.7 M
parameters
Experimental Results on ImageNet
Results confirmed on ImageNet:
Next steps
Transfer to Semantic Segmentation
Insertion of learnable “memories” of past training items
VT (no pooling)
VT1D
VT2D-1
VT2D-2
VT2D-4
5.6 M
parameters
21.7 M
parameters
Semantic Segmentation
Applications:
Limitations:
Sky
Car
Road
Person
Vegetation
…
Semantic Segmentation
Transformer based approaches for Semantic Segmentation
Moving Towards an Open World
[1] Ghiasi G, Gu X, Cui Y, Lin TY. Open-Vocabulary Image Segmentation. arXiv preprint arXiv:2112.12143. 2021.
What the DAAM: Interpreting Stable Diffusion Using Cross Attention
Raphael Tang et al. [1] introduce DAAM, a pipeline that exploits cross-attention between word tokens and features in the Stable Diffusion denoising subnetwork to detect attention maps on the generated image
1. Tang, Raphael, et al. "What the DAAM: Interpreting Stable Diffusion Using Cross Attention." arXiv 2022.
These attention maps can be used to inject information about locality for a large vocabulary in the unsupervised setting
Input prompt:
A woman walking past a
fire hydrant.
FOSSIL: Reference Collection Generation through Stable Diffusion and DAAM
1. Generate an image corresponding� to a caption from COCO
3. Use DAAM to extract the heatmap �corresponding to the noun textual token
4. Binarize the heatmap to
obtain a region proposal
6. Mask and pool the DINO features
according to the region proposal
5. Embed the image �with DINO
7. Store the pooled visual embedding and �the noun textual embeddings in the Reference Collection
2. Parse the nouns from the caption �and extract their textual embeddings
FOSSIL: Inference for Open-Vocabulary Segmentation
Comparison with state-of-the-art
FOSSIL achieves the new state-of-the-art on 4 segmentation datasets!
Examples of generated images, DAAM heatmaps, and OpenCut masks
A woman on a bicycle is checking her phone
A man and a dog
sitting by a lake
man
woman
soda
A soda and a hotdog are on the table
lake
bicycle
hotdog
dog
phone
table
Impact of OpenCut: qualitative examples
Original Image
Ground Truth
FOSSIL w/o OpenCut
FOSSIL with OpenCut
Impact of OpenCut: qualitative examples
Original Image
Ground Truth
FOSSIL w/o OpenCut
FOSSIL with OpenCut
Event Classification and Anomaly Detection
Events Classification
The goal is to achieve an advanced system for recognizing dangerous situations for the driver to be able to:
Image by macrovector on Freepik
Real-time Proactive Online System
It means
[1] Fatemidokht, Hamideh, et al. "Efficient and secure routing protocol based on artificial intelligence algorithms with UAV-assisted for vehicular ad hoc networks in intelligent transportation systems." IEEE Transactions on Intelligent Transportation Systems 22.7 (2021): 4757-4769.
Events Classification
Real-time: its response times must be short and predictive.
Proactive: don't wait for the accident to involve the vehicle itself, but react first.
Online: we do not know the future, only the present and, in a limited way, the past.
means
Real-time Proactive Online System
Image by macrovector on Freepik
Detection of Traffic Anomaly (DoTA) Dataset
Information available during the model training [1]:
[1] Yao, Yu, et al. "DoTA: unsupervised detection of traffic anomaly in driving videos." IEEE transactions on pattern analysis and machine intelligence (2022).
Novel Architecture
four main blocks �
Novel Architecture
Real-time: the model computation is reduced as possible, and the execution time required is fixed.
Proactive: for future applications, analyzing the dangerousness of what is happening outside the car, it could give to the driver, and who is around him, precious information.
Online: it computes three frames at time and a small latent state of the past.
Roadster Demo Interface
Roadster Demo Interface
Roadster Demo Interface
Roadster Demo Interface
Roadster Demo Interface
Roadster Demo Interface
Thank you!
Questions?