1 of 59

Towards Reliable Multimodal Learning for Real-world Systems

Hao Dong

2 of 59

Background

2

04.02.2026

2016 - 2020

Xi’an Jiaotong University

Bachelor

2020 - 2022

Aalto University

Master

2023 - 2024

EPFL

Visiting Researcher

2022 - 2025

ETH Zurich

Ph.D.

3 of 59

Research Interests

3

04.02.2026

Multimodal Learning

Generalization & Adaptation

Reliable and Trustworthy Machine Learning

Robotics and Autonomous Driving

4 of 59

Machine learning has made significant progress

4

04.02.2026

Action Recognition

Image Classification

Semantic Segmentation

Anomaly Detection

5 of 59

Machine learning can also fail disastrously

5

04.02.2026

6 of 59

A reliable model should

  • Generalize to different distribution shifts

6

04.02.2026

Night

Rain

Fog

Snow

7 of 59

A reliable model should

  • Detect out-of-distribution objects

7

04.02.2026

8 of 59

A reliable model should

  • Know when to trust its predictions

8

04.02.2026

Can I trust my model’s prediction?

“dog”

wrong prediction!

9 of 59

A reliable model should

  • Integrate multiple modalities

9

04.02.2026

Image

LiDAR

Video

Optical Flow

Audio

Thermal

10 of 59

Proposed Framework

10

04.02.2026

Improving Model Reliability

in Multimodal Settings

11 of 59

Multimodal Generalization and Adaptation

11

04.02.2026

Improving Model Reliability

in Multimodal Settings

Multimodal Generalization and Adaptation

NeurIPS 2023 & ECCV 2024 & ICLR 2025

Learning a multimodal model from source domains to either generalize to any unseen domain or adapt to a specific new target domain.

unseen

source domains

generalize

12 of 59

Multimodal Out-of-Distribution Detection

12

04.02.2026

Improving Model Reliability

in Multimodal Settings

Detecting inputs that are semantically novel by

leveraging information from multiple modalities

Multimodal Generalization and Adaptation

NeurIPS 2023 & ECCV 2024 & ICLR 2025

Multimodal Out-of-Distribution Detection

NeurIPS 2024 spotlight & CVPR 2025 highlight

13 of 59

Multimodal Misclassification Detection

13

04.02.2026

Improving Model Reliability

in Multimodal Settings

Multimodal Misclassification Detection

Detecting incorrect predictions by leveraging cross-modal cues to estimate prediction confidence

Multimodal Generalization and Adaptation

NeurIPS 2023 & ECCV 2024 & ICLR 2025

Multimodal Out-of-Distribution Detection

NeurIPS 2024 spotlight & CVPR 2025 highlight

“dog”

wrong prediction!

arXiv 2025

14 of 59

Outlier Synthesis for Robust Learning

14

04.02.2026

Improving Model Reliability

in Multimodal Settings

Multimodal Misclassification Detection

Outlier Synthesis for Robust Learning

TNNLS 2024&TMLR 2025&CVPR 2025&NeurIPS 2025

Intentionally generating outliers to expose the model and improve its robustness

Multimodal Generalization and Adaptation

NeurIPS 2023 & ECCV 2024 & ICLR 2025

Multimodal Out-of-Distribution Detection

NeurIPS 2024 spotlight & CVPR 2025 highlight

arXiv 2025

15 of 59

Proposed Framework

15

04.02.2026

Improving Model Reliability

in Multimodal Settings

Multimodal Misclassification Detection

Outlier Synthesis for Robust Learning

TNNLS 2024&TMLR 2025&CVPR 2025&NeurIPS 2025

This talk

Multimodal Generalization and Adaptation

NeurIPS 2023 & ECCV 2024 & ICLR 2025

Multimodal Out-of-Distribution Detection

NeurIPS 2024 spotlight & CVPR 2025 highlight

arXiv 2025

16 of 59

Traditional Domain Generalization Setup

16

04.02.2026

Unavailable during training

source domains

target domain

17 of 59

Multimodal Domain Generalization

17

04.02.2026

source domain 1

source domain 2

target domain

modaliaty 1: video

modaliaty 2: optical flow

modaliaty 3: audio

18 of 59

Multimodal Domain Generalization - Motivation

18

04.02.2026

Videos are from our newly introduced Human-Animal-Cartoon dataset

19 of 59

Multimodal Domain Generalization - Key Idea�

19

04.02.2026

Dong, Nejjar, Sun, Chatzi, Fink, “SimMMDG: A Simple and Effective Framework for Multi-modal Domain Generalization”, NeurIPS 2023

20 of 59

SimMMDG Framework

20

04.02.2026

21 of 59

SimMMDG Framework

21

04.02.2026

22 of 59

SimMMDG Framework

22

04.02.2026

23 of 59

SimMMDG Framework

23

04.02.2026

24 of 59

SimMMDG Framework

24

04.02.2026

25 of 59

Multimodal DG

25

04.02.2026

26 of 59

Multimodal DG

26

04.02.2026

27 of 59

Missing-modality DG

27

04.02.2026

28 of 59

Missing-modality DG

28

04.02.2026

Key takeaway: Decoupling features into modality-specific and shared components can preserve distinct information to improve generalization

29 of 59

Following Works

29

Dong, Chatzi, Fink, “Towards Multimodal Open-Set Domain Generalization

and Adaptation through Self-supervision”, ECCV 2024

Address both distribution and label shifts

Dong, Chatzi, Fink, “Towards Robust Multimodal Open-set Test-time

Adaptation via Adaptive Entropy-aware Optimization”, ICLR 2025

Adapt the model online

30 of 59

Following Works: A Comprehensive Survey

30

Dong et al., “Advances in Multimodal Adaptation and Generalization: From Traditional Approaches to Foundation Models”, TPAMI 2026

31 of 59

Out-of-Distribution Detection

31

04.02.2026

32 of 59

Out-of-Distribution Detection - Ideally

32

04.02.2026

“dog”

“cat”

“fish”

“dog”

“cat”

“fish”

0.8

0.15

0.05

0.4

0.3

0.3

33 of 59

Out-of-Distribution Detection - Overconfidence

33

04.02.2026

“dog”

“cat”

“fish”

“dog”

“cat”

“fish”

0.7

0.25

0.05

0.1

0.1

0.8

34 of 59

Out-of-Distribution Detection - Why Multimodal?�

34

04.02.2026

35 of 59

The first benchmark for Multimodal OOD Detection

35

04.02.2026

Dong, Zhao, Chatzi, Fink, “MultiOOD: Scaling Out-of-Distribution Detection for Multiple Modalities”, NeurIPS 2024 spotlight

MultiOOD Benchmark

36 of 59

36

04.02.2026

The predictions from video and optical flow show uniformity across ID data

and exhibit variability across OOD data.

Modality Prediction Discrepancy

37 of 59

37

04.02.2026

The prediction discrepancy are highly correlated to the ultimate OOD performance.

Modality Prediction Discrepancy

38 of 59

38

04.02.2026

Different modalities should Agree on the prediction regarding the ground-truth class,

while Disagree on the remaining classes by maximizing their prediction distance.

Agree-to-Disagree (A2D) Algorithm

39 of 59

39

04.02.2026

A2D training amplifies Modality Prediction Discrepancy, enhancing the efficacy of OOD detection.

Agree-to-Disagree (A2D) Algorithm

40 of 59

40

04.02.2026

Synthesizing outliers capable of spanning wider embedding spaces by leveraging the information

from nearest neighbor classes

Nearest Neighbor Prototype-based Mixup (NP-Mix)

41 of 59

Visualization of Synthesized Outliers

41

04.02.2026

(a) VOS

(b) NPOS

(d) NP-Mix (ours)

(c) Mixup

42 of 59

42

04.02.2026

Multimodal OOD Detection on HMDB51

43 of 59

Score Distributions

43

04.02.2026

(a) Energy

(c) Energy++

(b) LogitNorm

(d) LogitNorm++

44 of 59

Score Distributions

44

04.02.2026

Key takeaway: Maximizing prediction discrepancies across modalities significantly improves OOD performance

(a) Energy

(c) Energy++

(b) LogitNorm

(d) LogitNorm++

45 of 59

Following Work: Dynamic Prototype Updating

45

Li, Gong, Dong, Yang, Tu, Zhao, “DPU: Dynamic Prototype Updating for Multimodal Out-of-Distribution Detection”, CVPR 2025 hightlight

Amplify modality prediction discrepancies proportionally to the sample’s distance from its class prototype.

46 of 59

Multimodal Misclassification Detection

46

04.02.2026

Wrong

Prediction!

bird

dog

cat

plane

A photo of

a [CLASS]

A photo of

a dog

Text

Encoder

Image

Encoder

47 of 59

Multimodal Misclassification Detection - Ideally�

47

04.02.2026

Wrong

Prediction!

bird

dog

cat

plane

A photo of

a [CLASS]

A photo of

a dog

Text

Encoder

Image

Encoder

“dog”

“cat”

“bird”

0.8

0.15

0.05

“dog”

“cat”

“bird”

0.3

0.45

0.25

48 of 59

Multimodal Misclassification Detection - Overconfidence�

48

04.02.2026

Wrong

Prediction!

bird

dog

cat

plane

A photo of

a [CLASS]

A photo of

a dog

Text

Encoder

Image

Encoder

“dog”

“cat”

“bird”

0.75

0.15

0.1

“dog”

“cat”

“bird”

0.15

0.8

0.05

49 of 59

Motivation

49

04.02.2026

50 of 59

Motivation

50

04.02.2026

51 of 59

TrustVLM framework

51

04.02.2026

Dong, Liu, Liang, Chatzi, Fink, “To Trust Or Not To Trust Your Vision-Language Model’s Prediction”, arXiv 2025

52 of 59

TrustVLM framework

52

04.02.2026

53 of 59

TrustVLM framework

53

04.02.2026

54 of 59

Illustration on TrustVLM’s Mechanism

54

04.02.2026

55 of 59

Illustration on TrustVLM’s Mechanism

55

04.02.2026

56 of 59

Illustration on TrustVLM’s Mechanism

56

04.02.2026

Key takeaway: Image-to-image similarity provides an essential complement to image-to-text similarity

57 of 59

Outlier Synthesis for Robust Learning

57

04.02.2026

Dong, Frusque, Zhao, Chatzi, Fink, “NNG-Mix: Improving Semi-supervised

Anomaly Detection with Pseudo-anomaly Generation”, TNNLS 2024

Nejjar, Dong, Fink, “Recall and Refine: A Simple but Effective Source-free

Open-set Domain Adaptation Framework”, TMLR 2025

Liu*, Dong*, Kelly, Fink, Trapp, “Extremely Simple Multimodal Outlier

Synthesis for Out-of-Distribution Detection and Segmentation”, NeurIPS 2025

Sun, Cao, Dong, Fink, “Unseen Visual Anomaly Generation”, CVPR 2025

58 of 59

Conclusion

58

04.02.2026

Improving Model Reliability

in Multimodal Settings

Multimodal Misclassification Detection

arXiv 2025

Outlier Synthesis for Robust Learning

TNNLS 2024&TMLR 2025&CVPR 2025&NeurIPS 2025

Multimodal Generalization and Adaptation

NeurIPS 2023 & ECCV 2024 & ICLR 2025

Multimodal Out-of-Distribution Detection

NeurIPS 2024 spotlight & CVPR 2025 highlight

59 of 59

Thank You

Questions?