2 of 59

Background

04.02.2026

2016 - 2020

Xi’an Jiaotong University

Bachelor

2020 - 2022

Aalto University

Master

2023 - 2024

EPFL

Visiting Researcher

2022 - 2025

ETH Zurich

Ph.D.

3 of 59

Research Interests

04.02.2026

Multimodal Learning

Generalization & Adaptation

Reliable and Trustworthy Machine Learning

Robotics and Autonomous Driving

4 of 59

Machine learning has made significant progress

04.02.2026

Action Recognition

Image Classification

Semantic Segmentation

Anomaly Detection

5 of 59

Machine learning can also fail disastrously

04.02.2026

6 of 59

A reliable model should

Generalize to different distribution shifts

04.02.2026

Night

Rain

Fog

Snow

7 of 59

A reliable model should

Detect out-of-distribution objects

04.02.2026

8 of 59

A reliable model should

Know when to trust its predictions

04.02.2026

Can I trust my model’s prediction?

“dog”

wrong prediction!

9 of 59

A reliable model should

Integrate multiple modalities

04.02.2026

Image

LiDAR

Video

Optical Flow

Audio

Thermal

10 of 59

Proposed Framework

04.02.2026

Improving Model Reliability

in Multimodal Settings

11 of 59

Multimodal Generalization and Adaptation

04.02.2026

Improving Model Reliability

in Multimodal Settings

Multimodal Generalization and Adaptation

NeurIPS 2023 & ECCV 2024 & ICLR 2025

Learning a multimodal model from source domains to either generalize to any unseen domain or adapt to a specific new target domain.

unseen

source domains

generalize

12 of 59

Multimodal Out-of-Distribution Detection

04.02.2026

Improving Model Reliability

in Multimodal Settings

Detecting inputs that are semantically novel by

leveraging information from multiple modalities

Multimodal Generalization and Adaptation

NeurIPS 2023 & ECCV 2024 & ICLR 2025

Multimodal Out-of-Distribution Detection

NeurIPS 2024 spotlight & CVPR 2025 highlight

13 of 59

Multimodal Misclassification Detection

04.02.2026

Improving Model Reliability

in Multimodal Settings

Multimodal Misclassification Detection

Detecting incorrect predictions by leveraging cross-modal cues to estimate prediction confidence

Multimodal Generalization and Adaptation

NeurIPS 2023 & ECCV 2024 & ICLR 2025

Multimodal Out-of-Distribution Detection

NeurIPS 2024 spotlight & CVPR 2025 highlight

“dog”

wrong prediction!

arXiv 2025

14 of 59

Outlier Synthesis for Robust Learning

04.02.2026

Improving Model Reliability

in Multimodal Settings

Multimodal Misclassification Detection

Outlier Synthesis for Robust Learning

TNNLS 2024&TMLR 2025&CVPR 2025&NeurIPS 2025

Intentionally generating outliers to expose the model and improve its robustness

Multimodal Generalization and Adaptation

NeurIPS 2023 & ECCV 2024 & ICLR 2025

Multimodal Out-of-Distribution Detection

NeurIPS 2024 spotlight & CVPR 2025 highlight

arXiv 2025

15 of 59

Proposed Framework

04.02.2026

Improving Model Reliability

in Multimodal Settings

Multimodal Misclassification Detection

Outlier Synthesis for Robust Learning

TNNLS 2024&TMLR 2025&CVPR 2025&NeurIPS 2025

This talk

Multimodal Generalization and Adaptation

NeurIPS 2023 & ECCV 2024 & ICLR 2025

Multimodal Out-of-Distribution Detection

NeurIPS 2024 spotlight & CVPR 2025 highlight

arXiv 2025

16 of 59

Traditional Domain Generalization Setup

04.02.2026

Unavailable during training

source domains

target domain

17 of 59

Multimodal Domain Generalization

04.02.2026

source domain 1

source domain 2

target domain

modaliaty 1: video

modaliaty 2: optical flow

modaliaty 3: audio

18 of 59

Multimodal Domain Generalization - Motivation

04.02.2026

Videos are from our newly introduced Human-Animal-Cartoon dataset

19 of 59

Multimodal Domain Generalization - Key Idea�

04.02.2026

Dong, Nejjar, Sun, Chatzi, Fink, “SimMMDG: A Simple and Effective Framework for Multi-modal Domain Generalization”, NeurIPS 2023

20 of 59

SimMMDG Framework

04.02.2026

21 of 59

SimMMDG Framework

04.02.2026

22 of 59

SimMMDG Framework

04.02.2026

23 of 59

SimMMDG Framework

04.02.2026

24 of 59

SimMMDG Framework

04.02.2026

25 of 59

Multimodal DG

04.02.2026

26 of 59

Multimodal DG

04.02.2026

27 of 59

Missing-modality DG

04.02.2026

28 of 59

Missing-modality DG

04.02.2026

Key takeaway: Decoupling features into modality-specific and shared components can preserve distinct information to improve generalization

29 of 59

Following Works

Dong, Chatzi, Fink, “Towards Multimodal Open-Set Domain Generalization

and Adaptation through Self-supervision”, ECCV 2024

Address both distribution and label shifts

Dong, Chatzi, Fink, “Towards Robust Multimodal Open-set Test-time

Adaptation via Adaptive Entropy-aware Optimization”, ICLR 2025

Adapt the model online

30 of 59

Following Works: A Comprehensive Survey

Dong et al., “Advances in Multimodal Adaptation and Generalization: From Traditional Approaches to Foundation Models”, TPAMI 2026

31 of 59

Out-of-Distribution Detection

04.02.2026

32 of 59

Out-of-Distribution Detection - Ideally

04.02.2026

“dog”

“cat”

“fish”

“dog”

“cat”

“fish”

0.8

0.15

0.05

0.4

0.3

33 of 59

Out-of-Distribution Detection - Overconfidence

04.02.2026

“dog”

“cat”

“fish”

“dog”

“cat”

“fish”

0.7

0.25

0.05

0.1

0.8

34 of 59

Out-of-Distribution Detection - Why Multimodal?�

04.02.2026

35 of 59

The first benchmark for Multimodal OOD Detection

04.02.2026

Dong, Zhao, Chatzi, Fink, “MultiOOD: Scaling Out-of-Distribution Detection for Multiple Modalities”, NeurIPS 2024 spotlight

MultiOOD Benchmark

36 of 59

04.02.2026

The predictions from video and optical flow show uniformity across ID data

and exhibit variability across OOD data.

Modality Prediction Discrepancy

37 of 59

04.02.2026

The prediction discrepancy are highly correlated to the ultimate OOD performance.

Modality Prediction Discrepancy

38 of 59

04.02.2026

Different modalities should Agree on the prediction regarding the ground-truth class,

while Disagree on the remaining classes by maximizing their prediction distance.

Agree-to-Disagree (A2D) Algorithm

39 of 59

04.02.2026

A2D training amplifies Modality Prediction Discrepancy, enhancing the efficacy of OOD detection.

Agree-to-Disagree (A2D) Algorithm

40 of 59

04.02.2026

Synthesizing outliers capable of spanning wider embedding spaces by leveraging the information

from nearest neighbor classes

Nearest Neighbor Prototype-based Mixup (NP-Mix)

41 of 59

Visualization of Synthesized Outliers

04.02.2026

(a) VOS

(b) NPOS

(d) NP-Mix (ours)

42 of 59

04.02.2026

Multimodal OOD Detection on HMDB51

43 of 59

Score Distributions

04.02.2026

(a) Energy

(b) LogitNorm

(d) LogitNorm++

44 of 59

Score Distributions

04.02.2026

Key takeaway: Maximizing prediction discrepancies across modalities significantly improves OOD performance

(a) Energy

(b) LogitNorm

(d) LogitNorm++

45 of 59

Following Work: Dynamic Prototype Updating

Li, Gong, Dong, Yang, Tu, Zhao, “DPU: Dynamic Prototype Updating for Multimodal Out-of-Distribution Detection”, CVPR 2025 hightlight

Amplify modality prediction discrepancies proportionally to the sample’s distance from its class prototype.

46 of 59

Multimodal Misclassification Detection

04.02.2026

Wrong

Prediction!

bird

dog

cat

plane

A photo of

a [CLASS]

A photo of

a dog

Text

Encoder

Image

Encoder

47 of 59

Multimodal Misclassification Detection - Ideally�

04.02.2026

Wrong

Prediction!

bird

dog

cat

plane

A photo of

a [CLASS]

A photo of

a dog

Text

Encoder

Image

Encoder

“dog”

“cat”

“bird”

0.8

0.15

0.05

“dog”

“cat”

“bird”

0.3

0.45

0.25

48 of 59

Multimodal Misclassification Detection - Overconfidence�

04.02.2026

Wrong

Prediction!

bird

dog

cat

plane

A photo of

a [CLASS]

A photo of

a dog

Text

Encoder

Image

Encoder

“dog”

“cat”

“bird”

0.75

0.15

0.1

“dog”

“cat”

“bird”

0.15

0.8

0.05

49 of 59

Motivation

04.02.2026

50 of 59

Motivation

04.02.2026

51 of 59

TrustVLM framework

04.02.2026

Dong, Liu, Liang, Chatzi, Fink, “To Trust Or Not To Trust Your Vision-Language Model’s Prediction”, arXiv 2025

52 of 59

TrustVLM framework

04.02.2026

53 of 59

TrustVLM framework

04.02.2026

54 of 59

Illustration on TrustVLM’s Mechanism

04.02.2026

55 of 59

Illustration on TrustVLM’s Mechanism

04.02.2026

56 of 59

Illustration on TrustVLM’s Mechanism

04.02.2026

Key takeaway: Image-to-image similarity provides an essential complement to image-to-text similarity

57 of 59

Outlier Synthesis for Robust Learning

04.02.2026

Dong, Frusque, Zhao, Chatzi, Fink, “NNG-Mix: Improving Semi-supervised

Anomaly Detection with Pseudo-anomaly Generation”, TNNLS 2024

Nejjar, Dong, Fink, “Recall and Refine: A Simple but Effective Source-free

Open-set Domain Adaptation Framework”, TMLR 2025

Liu*, Dong*, Kelly, Fink, Trapp, “Extremely Simple Multimodal Outlier

Synthesis for Out-of-Distribution Detection and Segmentation”, NeurIPS 2025

Sun, Cao, Dong, Fink, “Unseen Visual Anomaly Generation”, CVPR 2025

58 of 59

Conclusion

04.02.2026

Improving Model Reliability

in Multimodal Settings

Multimodal Misclassification Detection

arXiv 2025

Outlier Synthesis for Robust Learning

TNNLS 2024&TMLR 2025&CVPR 2025&NeurIPS 2025

Multimodal Generalization and Adaptation

NeurIPS 2023 & ECCV 2024 & ICLR 2025

Multimodal Out-of-Distribution Detection

NeurIPS 2024 spotlight & CVPR 2025 highlight

1 of 59

2 of 59

3 of 59

4 of 59

5 of 59

6 of 59

7 of 59

8 of 59

9 of 59

10 of 59

11 of 59

12 of 59

13 of 59

14 of 59

15 of 59

16 of 59

17 of 59

18 of 59

19 of 59

20 of 59

21 of 59

22 of 59

23 of 59

24 of 59

25 of 59

26 of 59

27 of 59

28 of 59

29 of 59

30 of 59

31 of 59

32 of 59

33 of 59

34 of 59

35 of 59

36 of 59

37 of 59

38 of 59

39 of 59

40 of 59

41 of 59

42 of 59

43 of 59

44 of 59

45 of 59

46 of 59

47 of 59

48 of 59

49 of 59

50 of 59

51 of 59

52 of 59

53 of 59

54 of 59

55 of 59

56 of 59

57 of 59

58 of 59

59 of 59