Multimodal Learning Part 4:�Recent Directions and Open Questions
Paul Pu Liang
Machine Learning Department
Carnegie Mellon University
2
What is Multimodal?
Heterogeneous
Connected
Interacting
Why is it hard?
Multimodal is the scientific study of heterogeneous and interconnected data ☺
3
Core Multimodal Challenges
Representation
Alignment
Transference
Generation
Quantification
Reasoning
4
What is Multimodal?
Heterogeneous
Connected
Interacting
Why is it hard?
What is next?
Representation
Alignment
Reasoning
Generation
Transference
Quantification
4
5
Language
Vision
LIDAR
Audio
Sensors
Graphs
Control
Financial
Set
Table
Medical
Few modalities
High-modality
MultiBench
Non-parallel learning
Limited resources
Future Direction: High-modality
Challenges:
High-Modality Learning
6
How can we transfer knowledge across multiple tasks, each over a different subset of modalities?
Video classification
Sentiment, emotions
Language
Video
Video
Audio
Time-series
Audio
Video
Robot
dynamics
Generalization across modalities and tasks
Important if some tasks are low-resource
[Liang et al., MultiBench: Multiscale Benchmarks for Multimodal Representation Learning. NeurIPS 2021]
HighMMT
7
Transfer across partially observable modalities
HighMMT: unified model + parameter sharing + multitask and transfer learning
Language
Video
Audio
Audio
Video
Modality-specific embeddings
Standardized input sequence
HighMMT model
Shared multimodal model
Task-specific classifiers
Non-parallel multitask learning
Same model architecture!
Same parameters!
Video
Time-series
Video classification
Sentiment, emotions
Robot
dynamics
[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]
HighMMT
8
Traditional approaches: different model + different parameters
Robotic
manipulation
Disease
codes
Sarcasm
Humor
Image-text retrieval
Design
interface
Emotions
Language
Image
Speech
Video
Audio
Sensors
Proprioception
Time-series
Set
Table
Pareto front
All model combinations (>10,000)
[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]
HighMMT
9
Traditional approaches: different model + different parameters
Robotic
manipulation
Disease
codes
Sarcasm
Humor
Image-text retrieval
Design
interface
Emotions
Language
Image
Speech
Video
Audio
Sensors
Proprioception
Time-series
Set
Table
HighMMT
HighMMT
HighMMT
HighMMT
Pareto front
HighMMT single-task
All model combinations (>10,000)
[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]
HighMMT
10
Traditional approaches: different model + different parameters
Robotic
manipulation
Disease
codes
Sarcasm
Humor
Image-text retrieval
Design
interface
Emotions
Language
Image
Speech
Video
Audio
Sensors
Proprioception
Time-series
Set
Table
HighMMT multitask model
Pareto front
HighMMT single-task
HighMMT multitask
All model combinations (>10,000)
[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]
Quantifying Modality Heterogeneity
11
[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]
Information transfer, transfer learning perspective
1a. Estimate modality heterogeneity via transfer
1
Element representation
2
Element distribution
3
Structure
4
Information
5
Noise
Implicitly captures these:
Quantifying Modality Heterogeneity
12
[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]
Information transfer, transfer learning perspective
1a. Estimate modality heterogeneity via transfer
1b. Estimate interaction heterogeneity via transfer
0 | | | | |
1 | 0 | | | |
3 | 2 | 0 | | |
1 | 2 | 3 | 0 | |
5 | 4 | 6 | 3 | 0 |
2a. Compute modality heterogeneity matrix
{ }
{ }
{ }
{ }
{ }
0 | | | |
1 | 0 | | |
3 | 2 | 0 | |
1 | 2 | 4 | 0 |
{ }
{ }
{ }
2b. Compute interaction heterogeneity matrix
Quantifying Modality Heterogeneity
13
[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]
Information transfer, transfer learning perspective
3. Determine parameter clustering
0 | | | | |
1 | 0 | | | |
3 | 2 | 0 | | |
1 | 2 | 3 | 0 | |
5 | 4 | 6 | 3 | 0 |
2a. Compute modality heterogeneity matrix
{ }
{ }
{ }
{ }
{ }
0 | | | |
1 | 0 | | |
3 | 2 | 0 | |
1 | 2 | 4 | 0 |
{ }
{ }
{ }
2b. Compute interaction heterogeneity matrix
Quantifying Modality Heterogeneity
14
[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]
Information transfer, transfer learning perspective
1. Homogeneous Pre-training
2. Heterogeneity-aware Fine-tuning
HighMMT
15
HighMMT heterogeneity-aware sharing: estimate heterogeneity to determine parameter sharing
Robotic
manipulation
Disease
codes
Sarcasm
Humor
Image-text retrieval
Design
interface
Emotions
Language
Image
Speech
Video
Audio
Sensors
Proprioception
Time-series
Set
Table
HighMMT heterogeneity-aware sharing
Pareto front
HighMMT single-task
HighMMT multitask
All model combinations (>10,000)
HighMMT heterogeneity-aware
[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]
HighMMT
16
Transfer across partially observable modalities
HighMMT: unified model + parameter sharing + multitask and transfer learning
Disease
codes
Time-series
Table
Transfer
Language
Video
Audio
Audio
Video
HighMMT model
Video
Time-series
Video classification
Sentiment, emotions
Robot dynamics
[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]
HighMMT
17
Transfer across partially observable modalities
HighMMT: unified model + parameter sharing + multitask and transfer learning
67.7%
68.3%
68.5%
68.5%
Target task: MIMIC
# source tasks
2
1
0
3
(from different modalities, research areas, and tasks)
Achieves both multitask and transfer capabilities across modalities and tasks
63.3%
64.1%
65.5%
65.7%
2
Target task: UR-FUNNY
# source tasks
1
0
3
(from different modalities, research areas, and tasks)
[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]
18
Homogeneity
Heterogeneity
vs
Arbitrary tokenization
Beyond differentiable
interactions
Causal, logical, brain-inspired
Theoretical foundations of interactions
Future Direction: Heterogeneity & Interactions
Challenges:
19
[Liang et al., Quantifying & Modeling Feature Interactions: An Information Decomposition Framework. arXiv 2023]
Unique 1
Unique 2
Redundancy
Synergy
Quantifying Interactions
1. Dataset quantification:
2. Model quantification:
3. Model selection:
19
20
[Liang et al., Quantifying & Modeling Feature Interactions: An Information Decomposition Framework. arXiv 2023]
Quantifying Interactions
20
Classical Information Theory
Partial Information Decomposition
Can be negative!
21
[Liang et al., Quantifying & Modeling Feature Interactions: An Information Decomposition Framework. arXiv 2023]
Quantifying Interactions
21
Marginal-matching distributions
Only need p(x1,y) and p(x2,y) to infer R, U1, and U2
Need p(x1,x2) to infer S
22
[Liang et al., Quantifying & Modeling Feature Interactions: An Information Decomposition Framework. arXiv 2023]
Quantifying Interactions
22
Marginal-matching distributions
If X1, X2, Y have small and discrete support,
exact solution via convex programming with linear constraints.
For high-dimensional & continuous X1, X2, Y, an approximate neural network estimator.
23
[Liang et al., Quantifying & Modeling Feature Interactions: An Information Decomposition Framework. arXiv 2023]
Model Selection
23
Interaction polytope
1. Dataset quantification:
Can be done with synthetic data
[Liang et al., Quantifying & Modeling Feature Interactions: An Information Decomposition Framework. arXiv 2023]
Model Selection
24
Interaction polytope
2. Model quantification:
Model families trained on synthetic data
[Liang et al., Quantifying & Modeling Feature Interactions: An Information Decomposition Framework. arXiv 2023]
Model Selection
25
Interaction polytope
3. Model selection:
Selects models with >96% performance
26
seconds
or minutes
Short-term
Long-term
Compositionality
Memory
Personalization
Future Direction: Long-term
Challenges:
27
Perception
Reasoning
Generation
Social Intelligence
Multimodal
Interaction
Social-IQ
Multi-Party
Generation
Ethics
Future Direction: Interaction
Challenges:
28
Healthcare
Decision Support
Intelligent Interfaces and Vehicles
Online Learning
and Education
MultiViz
Challenges:
Fairness
Robustness
Generalization
Future Direction: Real-world
Interpretation
29
MultiViz: Visualizing & Interpreting Multimodal Models
How can we understand the modeling of heterogeneity and interconnections
and gain insights for safe real-world deployment?
Internal mechanics
MultiViz: Visualizing & Interpreting Multimodal Models
30
Is there a red shape above a circle?
Yes!
[Liang et al., MultiViz: Towards Visualizing and Understanding Multimodal Models. ICLR 2023, CHI 2023 Late Breaking Work]
How can we understand the modeling of heterogeneity and interconnections
and gain insights for safe real-world deployment?
MultiViz: Visualizing & Interpreting Multimodal Models
31
Is there a red shape above a circle?
Yes!
Is there a red shape above a circle?
Yes!
How can we understand the modeling of heterogeneity and interconnections
and gain insights for safe real-world deployment?
[Liang et al., MultiViz: Towards Visualizing and Understanding Multimodal Models. ICLR 2023, CHI 2023 Late Breaking Work]
MultiViz: Visualizing & Interpreting Multimodal Models
32
Unimodal importance: Does the model correctly identify keywords in the question?
Is there a red shape above a circle?
Yes!
1. Unimodal importance
Is there a red shape above a circle?
[Liang et al., MultiViz: Towards Visualizing and Understanding Multimodal Models. ICLR 2023, CHI 2023 Late Breaking Work]
MultiViz: Visualizing & Interpreting Multimodal Models
33
Cross-modal interactions: Does the model correctly relate the question with the image?
Is there a red shape above a circle?
Yes!
1. Unimodal importance
Is there a red shape above a circle?
2. Cross-modal interactions
Is there a red shape above a circle?
MultiViz: Visualizing & Interpreting Multimodal Models
34
Multimodal representations: Does the model consistently assign concepts to features?
1. Unimodal importance
Is there a red shape above a circle?
2. Cross-modal interactions
Is there a red shape above a circle?
Is there a red shape above a circle?
Yes!
3. Multimodal
representations
red
circle
MultiViz: Visualizing & Interpreting Multimodal Models
35
Multimodal prediction: Does the model correctly compose question and image information?
1. Unimodal importance
Is there a red shape above a circle?
2. Cross-modal interactions
Is there a red shape above a circle?
Is there a red shape above a circle?
Yes!
3. Multimodal
representations
4. Multimodal
prediction
Circle:
Above:
N/A
Red:
Yes!
red
circle
MultiViz: Visualizing & Interpreting Multimodal Models
36
How can we interpret cross-modal interactions in multimodal models?
2. Cross-modal interactions
Is there a red shape above a circle?
Natural second-order extension of gradient-based approaches!
Statistical non-additive interactions [Friedman & Popescu, 2008, Sorokina et al., 2008]
Also related: EMAP [Hessel et al., 2020], DIME [Lyu et al., 2022]
MultiViz: Visualizing & Interpreting Multimodal Models
37
How can we interpret cross-modal interactions in multimodal models?
The other small shiny thing that is the same shape as the tiny yellow shiny object is what color?
CLEVR
Flickr-30k
Three small dogs, two white and one black and white, on a sidewalk.
How many birds?
VQA 2.0
Correspondences
CMU-MOSEI
Why am I spending my money watching this? (sigh) I think I was more sad…
Relationships
[Liang et al., MultiViz: Towards Visualizing and Understanding Multimodal Models. ICLR 2023, CHI 2023 Late Breaking Work]
MultiViz: Visualizing & Interpreting Multimodal Models
38
How can we understand multimodal representations?
What color is the tie of the second man to the left?
Local analysis
Global analysis
What color is the Salisbury Rd sign?
What color is the building?
What color are the checkers on the wall?
Blue
3. Multimodal
representations
“color”
Evaluating Interpretability
39
How can we evaluate the success of interpreting internal mechanics?
Problem: real-world datasets and models do not have
unimodal importance, cross-modal interactions, representations annotated!
Unimodal importance
Cross-modal
interactions
Multimodal
representations
Multimodal
prediction
[Liang et al., MultiViz: Towards Visualizing and Understanding Multimodal Models. ICLR 2023, CHI 2023 Late Breaking Work]
Evaluating Interpretability
40
How can we evaluate the success of interpreting internal mechanics?
Can humans reproduce model predictions with high accuracy and agreement?
Unimodal importance
Cross-modal
interactions
Multimodal
representations
“Yes”
“Yes”
Multimodal
prediction
Open
challenges
[Liang et al., MultiViz: Towards Visualizing and Understanding Multimodal Models. ICLR 2023, CHI 2023 Late Breaking Work]
Evaluating Interpretability
41
How can we evaluate the success of interpreting internal mechanics?
65.0%
U + C
71.7%
U + C +
Local R +
Global R
61.7%
U + C +
Local R
U + C +
Local R +
Global R + P
81.7%
55.0%
U
MultiViz stages leads to higher accuracy and agreement
[Liang et al., MultiViz: Towards Visualizing and Understanding Multimodal Models. ICLR 2023, CHI 2023 Late Breaking Work]
Evaluating Interpretability
42
How can we evaluate the success of interpreting internal mechanics?
Find bugs
Fix bugs
2. Model debugging
Can humans find bugs in the model
for improvement?
Unimodal importance
Cross-modal
interactions
Multimodal
representations
Multimodal
prediction
Open
challenges
[Liang et al., MultiViz: Towards Visualizing and Understanding Multimodal Models. ICLR 2023, CHI 2023 Late Breaking Work]
Evaluating Interpretability
43
How can we understand multimodal representations?
What color is the tie of the second man to the left?
Local analysis
Global analysis
What color is the Salisbury Rd sign?
What color is the building?
What color are the checkers on the wall?
Blue
3. Multimodal
representations
“Models pick up cross-modal interactions but fail in identifying color!”
“color”
Evaluating Interpretability
44
How can we evaluate the success of interpreting internal mechanics?
+0.2%
Uncertainty
+30.5%
MultiViz
+1.4%
Random
MultiViz enables error analysis and debugging of multimodal models
“Models pick up cross-modal interactions but fail in identifying color!”
Add targeted examples involving color.
Side note: we used this to discover a bug in a popular deep learning code repository.
[Liang et al., MultiViz: Towards Visualizing and Understanding Multimodal Models. ICLR 2023, CHI 2023 Late Breaking Work]
45
What is Multimodal?
Heterogeneous
Connected
Interacting
Why is it hard?
What is next?
Representation
Alignment
Reasoning
Generation
Transference
Quantification
High-modality
Heterogeneity
Long-term
Interaction
Real-world
Liang, Zadeh, and Morency. Foundations and Trends on Multimodal Machine Learning. 2022