Towards Reliable Multimodal Learning for Real-world Systems
Hao Dong
Background
2
04.02.2026
2016 - 2020
Xi’an Jiaotong University
Bachelor
2020 - 2022
Aalto University
Master
2023 - 2024
EPFL
Visiting Researcher
2022 - 2025
ETH Zurich
Ph.D.
Research Interests
3
04.02.2026
Multimodal Learning
Generalization & Adaptation
Reliable and Trustworthy Machine Learning
Robotics and Autonomous Driving
Machine learning has made significant progress
4
04.02.2026
Action Recognition
Image Classification
Semantic Segmentation
Anomaly Detection
Machine learning can also fail disastrously
5
04.02.2026
A reliable model should
6
04.02.2026
Night
Rain
Fog
Snow
A reliable model should
7
04.02.2026
A reliable model should
8
04.02.2026
Can I trust my model’s prediction?
“dog”
wrong prediction!
A reliable model should
9
04.02.2026
Image
LiDAR
Video
Optical Flow
Audio
Thermal
Proposed Framework
10
04.02.2026
Improving Model Reliability
in Multimodal Settings
Multimodal Generalization and Adaptation
11
04.02.2026
Improving Model Reliability
in Multimodal Settings
Multimodal Generalization and Adaptation
NeurIPS 2023 & ECCV 2024 & ICLR 2025
Learning a multimodal model from source domains to either generalize to any unseen domain or adapt to a specific new target domain.
unseen
source domains
generalize
Multimodal Out-of-Distribution Detection
12
04.02.2026
Improving Model Reliability
in Multimodal Settings
Detecting inputs that are semantically novel by
leveraging information from multiple modalities
Multimodal Generalization and Adaptation
NeurIPS 2023 & ECCV 2024 & ICLR 2025
Multimodal Out-of-Distribution Detection
NeurIPS 2024 spotlight & CVPR 2025 highlight
Multimodal Misclassification Detection
13
04.02.2026
Improving Model Reliability
in Multimodal Settings
Multimodal Misclassification Detection
Detecting incorrect predictions by leveraging cross-modal cues to estimate prediction confidence
Multimodal Generalization and Adaptation
NeurIPS 2023 & ECCV 2024 & ICLR 2025
Multimodal Out-of-Distribution Detection
NeurIPS 2024 spotlight & CVPR 2025 highlight
“dog”
wrong prediction!
arXiv 2025
Outlier Synthesis for Robust Learning
14
04.02.2026
Improving Model Reliability
in Multimodal Settings
Multimodal Misclassification Detection
Outlier Synthesis for Robust Learning
TNNLS 2024&TMLR 2025&CVPR 2025&NeurIPS 2025
Intentionally generating outliers to expose the model and improve its robustness
Multimodal Generalization and Adaptation
NeurIPS 2023 & ECCV 2024 & ICLR 2025
Multimodal Out-of-Distribution Detection
NeurIPS 2024 spotlight & CVPR 2025 highlight
arXiv 2025
Proposed Framework
15
04.02.2026
Improving Model Reliability
in Multimodal Settings
Multimodal Misclassification Detection
Outlier Synthesis for Robust Learning
TNNLS 2024&TMLR 2025&CVPR 2025&NeurIPS 2025
This talk
Multimodal Generalization and Adaptation
NeurIPS 2023 & ECCV 2024 & ICLR 2025
Multimodal Out-of-Distribution Detection
NeurIPS 2024 spotlight & CVPR 2025 highlight
arXiv 2025
Traditional Domain Generalization Setup
16
04.02.2026
Unavailable during training
source domains
target domain
Multimodal Domain Generalization
17
04.02.2026
source domain 1
source domain 2
target domain
modaliaty 1: video
modaliaty 2: optical flow
modaliaty 3: audio
Multimodal Domain Generalization - Motivation
18
04.02.2026
Videos are from our newly introduced Human-Animal-Cartoon dataset
Multimodal Domain Generalization - Key Idea�
19
04.02.2026
Dong, Nejjar, Sun, Chatzi, Fink, “SimMMDG: A Simple and Effective Framework for Multi-modal Domain Generalization”, NeurIPS 2023
SimMMDG Framework
20
04.02.2026
SimMMDG Framework
21
04.02.2026
SimMMDG Framework
22
04.02.2026
SimMMDG Framework
23
04.02.2026
SimMMDG Framework
24
04.02.2026
Multimodal DG
25
04.02.2026
Multimodal DG
26
04.02.2026
Missing-modality DG
27
04.02.2026
Missing-modality DG
28
04.02.2026
Key takeaway: Decoupling features into modality-specific and shared components can preserve distinct information to improve generalization
Following Works
29
Dong, Chatzi, Fink, “Towards Multimodal Open-Set Domain Generalization
and Adaptation through Self-supervision”, ECCV 2024
Address both distribution and label shifts
Dong, Chatzi, Fink, “Towards Robust Multimodal Open-set Test-time
Adaptation via Adaptive Entropy-aware Optimization”, ICLR 2025
Adapt the model online
Following Works: A Comprehensive Survey
30
Dong et al., “Advances in Multimodal Adaptation and Generalization: From Traditional Approaches to Foundation Models”, TPAMI 2026
Out-of-Distribution Detection
31
04.02.2026
Out-of-Distribution Detection - Ideally
32
04.02.2026
“dog”
“cat”
“fish”
“dog”
“cat”
“fish”
0.8
0.15
0.05
0.4
0.3
0.3
Out-of-Distribution Detection - Overconfidence
33
04.02.2026
“dog”
“cat”
“fish”
“dog”
“cat”
“fish”
0.7
0.25
0.05
0.1
0.1
0.8
Out-of-Distribution Detection - Why Multimodal?�
34
04.02.2026
The first benchmark for Multimodal OOD Detection
35
04.02.2026
Dong, Zhao, Chatzi, Fink, “MultiOOD: Scaling Out-of-Distribution Detection for Multiple Modalities”, NeurIPS 2024 spotlight
MultiOOD Benchmark
36
04.02.2026
The predictions from video and optical flow show uniformity across ID data
and exhibit variability across OOD data.
Modality Prediction Discrepancy
37
04.02.2026
The prediction discrepancy are highly correlated to the ultimate OOD performance.
Modality Prediction Discrepancy
38
04.02.2026
Different modalities should Agree on the prediction regarding the ground-truth class,
while Disagree on the remaining classes by maximizing their prediction distance.
Agree-to-Disagree (A2D) Algorithm
39
04.02.2026
A2D training amplifies Modality Prediction Discrepancy, enhancing the efficacy of OOD detection.
Agree-to-Disagree (A2D) Algorithm
40
04.02.2026
Synthesizing outliers capable of spanning wider embedding spaces by leveraging the information
from nearest neighbor classes
Nearest Neighbor Prototype-based Mixup (NP-Mix)
Visualization of Synthesized Outliers
41
04.02.2026
(a) VOS
(b) NPOS
(d) NP-Mix (ours)
(c) Mixup
42
04.02.2026
Multimodal OOD Detection on HMDB51
Score Distributions
43
04.02.2026
(a) Energy
(c) Energy++
(b) LogitNorm
(d) LogitNorm++
Score Distributions
44
04.02.2026
Key takeaway: Maximizing prediction discrepancies across modalities significantly improves OOD performance
(a) Energy
(c) Energy++
(b) LogitNorm
(d) LogitNorm++
Following Work: Dynamic Prototype Updating
45
Li, Gong, Dong, Yang, Tu, Zhao, “DPU: Dynamic Prototype Updating for Multimodal Out-of-Distribution Detection”, CVPR 2025 hightlight
Amplify modality prediction discrepancies proportionally to the sample’s distance from its class prototype.
Multimodal Misclassification Detection
46
04.02.2026
Wrong
Prediction!
bird
dog
cat
plane
A photo of
a [CLASS]
A photo of
a dog
Text
Encoder
Image
Encoder
Multimodal Misclassification Detection - Ideally�
47
04.02.2026
Wrong
Prediction!
bird
dog
cat
plane
A photo of
a [CLASS]
A photo of
a dog
Text
Encoder
Image
Encoder
“dog”
“cat”
“bird”
0.8
0.15
0.05
“dog”
“cat”
“bird”
0.3
0.45
0.25
Multimodal Misclassification Detection - Overconfidence�
48
04.02.2026
Wrong
Prediction!
bird
dog
cat
plane
A photo of
a [CLASS]
A photo of
a dog
Text
Encoder
Image
Encoder
“dog”
“cat”
“bird”
0.75
0.15
0.1
“dog”
“cat”
“bird”
0.15
0.8
0.05
Motivation
49
04.02.2026
Motivation
50
04.02.2026
TrustVLM framework
51
04.02.2026
Dong, Liu, Liang, Chatzi, Fink, “To Trust Or Not To Trust Your Vision-Language Model’s Prediction”, arXiv 2025
TrustVLM framework
52
04.02.2026
TrustVLM framework
53
04.02.2026
Illustration on TrustVLM’s Mechanism
54
04.02.2026
Illustration on TrustVLM’s Mechanism
55
04.02.2026
Illustration on TrustVLM’s Mechanism
56
04.02.2026
Key takeaway: Image-to-image similarity provides an essential complement to image-to-text similarity
Outlier Synthesis for Robust Learning
57
04.02.2026
Dong, Frusque, Zhao, Chatzi, Fink, “NNG-Mix: Improving Semi-supervised
Anomaly Detection with Pseudo-anomaly Generation”, TNNLS 2024
Nejjar, Dong, Fink, “Recall and Refine: A Simple but Effective Source-free
Open-set Domain Adaptation Framework”, TMLR 2025
Liu*, Dong*, Kelly, Fink, Trapp, “Extremely Simple Multimodal Outlier
Synthesis for Out-of-Distribution Detection and Segmentation”, NeurIPS 2025
Sun, Cao, Dong, Fink, “Unseen Visual Anomaly Generation”, CVPR 2025
Conclusion
58
04.02.2026
Improving Model Reliability
in Multimodal Settings
Multimodal Misclassification Detection
arXiv 2025
Outlier Synthesis for Robust Learning
TNNLS 2024&TMLR 2025&CVPR 2025&NeurIPS 2025
Multimodal Generalization and Adaptation
NeurIPS 2023 & ECCV 2024 & ICLR 2025
Multimodal Out-of-Distribution Detection
NeurIPS 2024 spotlight & CVPR 2025 highlight
Thank You
Questions?