1 of 38

Roadster:

Road Sustainable Twins in Emilia Romagna

2 of 38

Introduction

  • Goal: ROADSTER aims at developing state-of-the-art Computer Vision and Deep Learning solutions for the ecosystem of roads and transportation facilities (possibly in industrial areas)

  • Long-vision goals to provide new data and services for:
    • Road conditions and anomalies
    • Personalized services on traffic and transport optimization
    • Improvement of worker safety during journeys

*”ROADSTER does not deal with the adoption of AI. It instead focuses on the design and production of new AI technologies in the research centers involved in the project. “

3 of 38

Introduction

Activities carried out, until now:

  • Research towards Transformer-based models for visual recognition (UNIMORE)
  • Research on (open-vocabulary) semantic segmentation for images taken from dash cameras, also through the adoption of superpixel priors and V&L techniques (UNIMORE);
  • Research on novel algorithms for the recognition of anomalies and dangerous situations, as well as event classification strategies (UNIPR);
  • Development of techniques for allowing the execution of advanced AI algorithms on vehicle-mounted ultra-low-cost, low-footprint boards and cameras (UNIBO);
  • The development of the Roadster demo interface, employed to showcase the results of the project (UNIMORE).

*”ROADSTER does not deal with the adoption of AI. It instead focuses on the design and production of new AI technologies in the research centers involved in the project. “

4 of 38

Transformer-based models for Visual Recognition and Semantic Segmentation

5 of 38

Vision Transformer (ViT) achieves state-of-the-art in Classification

  • Infinite receptive field (content-based pairwise similarities)
  • Ability to learn long-range dependencies
  • Fewer sequential operations (network parallelization)
  • Scalable architecture
  • Better generalization (no inductive biases such as locality and translation equivariance)

[1] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J. “An image is worth 16x16 words: Transformers for image recognition at scale”. ICLR 2021.

Advantages

Vision Transformer (ViT) [1]

Transformer Encoder

9

8

7

6

5

4

3

2

1

0

Patch + Positional

Embedding

Extra CLS token

FC

Dog

Cat

Bird

Linear Projection

6 of 38

Drawbacks: high computational cost & lack of hierarchical features

  • High computational cost (maintain full-length sequence across all layers)
  • High memory consumption
  • It lacks multi-level hierarchical representations (essential for visual tasks)

Disadvantages

VT2D: a Hierarchical ViT with Bidimensional Max Pooling

Our Proposal:

  • Significant reduction of the visual tokens sequence length
  • Better localization of features compared to 1D pooling

7 of 38

VT2D: a Hierarchical ViT with 2D Max Pooling

8 of 38

VT2D: a Hierarchical ViT with 2D Max Pooling

Linear Projection

0

1

2

3

12

13

14

15

Patch

Without CLS token

+ Pos

Transformer Layer

9 of 38

VT2D: a Hierarchical ViT with 2D Max Pooling

Linear Projection

0

1

2

3

12

13

14

15

Transformer Layer

2D Max Pooling Layer

Patch

Without CLS token

+ Pos

10 of 38

VT2D: a Hierarchical ViT with 2D Max Pooling

xM

Linear Projection

0

1

2

3

12

13

14

15

Transformer Layer

2D Max Pooling Layer

0

1

2

3

Transformer Layer

Transformer Layer

Average Pooling

FC

Dog

Cat

Bird

Patch

Without CLS token

+ Pos

11 of 38

Investigating 2D Max Pooling at different stages

Linear Projection

0

1

2

3

28

29

30

31

Patch

Without CLS token

+ Pos

2D Max Pooling

Transformer Layer

Transformer Layer

Transformer Layer

Average Pooling

FC

Dog

Cat

Bird

2D Max Pooling

Transformer Layer

Transformer Layer

Transformer Layer

2D Max Pooling

Transformer Layer

Transformer Layer

Transformer Layer

0

1

2

3

2D Max Pooling

Transformer Layer

Transformer Layer

Transformer Layer

0

1

2

3

12

13

14

15

0

1

2

3

4

5

6

7

0

1

Stage 1

Stage 2

Stage 3

Stage 4

VT2D-4: 2D Pooling layer in all 4 stages

12 of 38

VT2D-Small: Experimental Results on CIFAR-100

 

Pooling with small kernel size

  • Higher accuracy
  • Higher memory and computational cost

Pooling with large kernel size

  • Lower accuracy
  • Lower memory and computational cost

13 of 38

Performance Comparison: Accuracy vs. FLOPs

VT (no pooling)

VT1D

VT2D-1

VT2D-2

VT2D-4

5.6 M

parameters

21.7 M

parameters

14 of 38

Experimental Results on ImageNet

Results confirmed on ImageNet:

  • Pooling in early stages improves accuracy and significantly reduces FLOPs
  • Pooling with small kernel size improves accuracy at the price of higher memory and computational costs

15 of 38

Next steps

Transfer to Semantic Segmentation

  • Lower computational footprint ☺

Insertion of learnable “memories” of past training items

  • Better accuracy ☺

VT (no pooling)

VT1D

VT2D-1

VT2D-2

VT2D-4

5.6 M

parameters

21.7 M

parameters

16 of 38

Semantic Segmentation

Applications:

  • Analyze objects in the urban scene and classify traffic patterns

  • Monitor road condition and detect variations over time (use GPS location)

Limitations:

  • Robustness to unseen situations

  • Fixed number of classes depending on the specific training dataset

Sky

Car

Road

Person

Vegetation

17 of 38

Semantic Segmentation

Transformer based approaches for Semantic Segmentation

  • They split the input image in regular square patches�🡺 a grouping strategy that ignores semantic and perceptual similarity

  • We investigate an appearance-based grouping strategy, based on superpixels
    • Superpixels group according to perceptual similarity in color space
    • Positional-encoding strategy to encode variable shapes both at encoding and at decoding time
  • Experimentally, we show improved results in comparison with standard backbones trained on regular patches, especially on rare classes.

18 of 38

Moving Towards an Open World

  • Open-Vocabulary Image Segmentation [1] exploits the alignment between image and text embeddings to classify classes never seen during training
    • Ability to identify novel objects without annotations
    • Ability to make open-vocabulary queries to an image to highlight specific objects in the scene

[1] Ghiasi G, Gu X, Cui Y, Lin TY. Open-Vocabulary Image Segmentation. arXiv preprint arXiv:2112.12143. 2021.

19 of 38

What the DAAM: Interpreting Stable Diffusion Using Cross Attention

Raphael Tang et al. [1] introduce DAAM, a pipeline that exploits cross-attention between word tokens and features in the Stable Diffusion denoising subnetwork to detect attention maps on the generated image

1. Tang, Raphael, et al. "What the DAAM: Interpreting Stable Diffusion Using Cross Attention." arXiv 2022.

These attention maps can be used to inject information about locality for a large vocabulary in the unsupervised setting

Input prompt:

A woman walking past a

fire hydrant.

20 of 38

FOSSIL: Reference Collection Generation through Stable Diffusion and DAAM

1. Generate an image corresponding� to a caption from COCO

3. Use DAAM to extract the heatmap �corresponding to the noun textual token

4. Binarize the heatmap to

obtain a region proposal

6. Mask and pool the DINO features

according to the region proposal

5. Embed the image �with DINO

7. Store the pooled visual embedding and �the noun textual embeddings in the Reference Collection

2. Parse the nouns from the caption �and extract their textual embeddings

21 of 38

FOSSIL: Inference for Open-Vocabulary Segmentation

 

 

22 of 38

Comparison with state-of-the-art

FOSSIL achieves the new state-of-the-art on 4 segmentation datasets!

23 of 38

Examples of generated images, DAAM heatmaps, and OpenCut masks

A woman on a bicycle is checking her phone

A man and a dog

sitting by a lake

man

woman

soda

A soda and a hotdog are on the table

lake

bicycle

hotdog

dog

phone

table

24 of 38

Impact of OpenCut: qualitative examples

Original Image

Ground Truth

FOSSIL w/o OpenCut

FOSSIL with OpenCut

25 of 38

Impact of OpenCut: qualitative examples

Original Image

Ground Truth

FOSSIL w/o OpenCut

FOSSIL with OpenCut

26 of 38

Event Classification and Anomaly Detection

27 of 38

Events Classification

The goal is to achieve an advanced system for recognizing dangerous situations for the driver to be able to:

  • avoiding the collision if not yet directly involved;
  • minimizing the damage (e.g. protecting the pedestrians);
  • capillary disseminate information between interconnected devices about anomalies [1].

Image by macrovector on Freepik

Real-time Proactive Online System

It means

[1] Fatemidokht, Hamideh, et al. "Efficient and secure routing protocol based on artificial intelligence algorithms with UAV-assisted for vehicular ad hoc networks in intelligent transportation systems." IEEE Transactions on Intelligent Transportation Systems 22.7 (2021): 4757-4769.

28 of 38

Events Classification

Real-time: its response times must be short and predictive.

Proactive: don't wait for the accident to involve the vehicle itself, but react first.

Online: we do not know the future, only the present and, in a limited way, the past.

means

Real-time Proactive Online System

Image by macrovector on Freepik

29 of 38

Detection of Traffic Anomaly (DoTA) Dataset

Information available during the model training [1]:

  • dashcam video
  • anomalous frames
  • fixation information (replaceable with the location of the subjects involved)

[1] Yao, Yu, et al. "DoTA: unsupervised detection of traffic anomaly in driving videos." IEEE transactions on pattern analysis and machine intelligence (2022).

30 of 38

Novel Architecture

four main blocks

  1. a short-term memory module to encode the information related to what is happening in the present and the near past;
  2. a long-term memory module to keep track of the remote past;
  3. a saliency module to increase the relevant information about the current frame scene;
  4. the final classification (head) module.

31 of 38

Novel Architecture

Real-time: the model computation is reduced as possible, and the execution time required is fixed.

Proactive: for future applications, analyzing the dangerousness of what is happening outside the car, it could give to the driver, and who is around him, precious information.

Online: it computes three frames at time and a small latent state of the past.

32 of 38

Roadster Demo Interface

33 of 38

Roadster Demo Interface

  • Dash camera videos with GPS coordinates

34 of 38

Roadster Demo Interface

  • Dash camera videos with GPS coordinates
  • Visualization of semantic segmentation results, according to different visual backbones

35 of 38

Roadster Demo Interface

  • Dash camera videos with GPS coordinates
  • Visualization of semantic segmentation results, according to different visual backbones
  • Prediction of traffic intensity and traffic composition

36 of 38

Roadster Demo Interface

  • Dash camera videos with GPS coordinates
  • Visualization of semantic segmentation results, according to different visual backbones
  • Traffic tracking and re-identification
  • Prediction of traffic intensity and traffic composition

37 of 38

Roadster Demo Interface

  • Dash camera videos with GPS coordinates
  • Synchronization between input videos and Google Maps or Google Street View, through GPS coordinates and orientation estimation
  • Visualization of semantic segmentation results, according to different visual backbones
  • Traffic tracking and re-identification
  • Prediction of traffic intensity and traffic composition

38 of 38

Thank you!

Questions?