1 of 45

Multimodal Learning Part 4:�Recent Directions and Open Questions

Paul Pu Liang

Machine Learning Department

Carnegie Mellon University

2 of 45

2

What is Multimodal?

Heterogeneous

Connected

Interacting

Why is it hard?

Multimodal is the scientific study of heterogeneous and interconnected data ☺

 

3 of 45

3

Core Multimodal Challenges

Representation

Alignment

Transference

Generation

Quantification

Reasoning

 

4 of 45

4

What is Multimodal?

Heterogeneous

Connected

Interacting

Why is it hard?

What is next?

 

Representation

Alignment

Reasoning

Generation

Transference

Quantification

4

5 of 45

5

Language

Vision

LIDAR

Audio

Sensors

Graphs

Control

Financial

Set

Table

Medical

Few modalities

High-modality

MultiBench

Non-parallel learning

Limited resources

Future Direction: High-modality

Challenges:

6 of 45

High-Modality Learning

6

How can we transfer knowledge across multiple tasks, each over a different subset of modalities?

Video classification

Sentiment, emotions

Language

Video

Video

Audio

Time-series

Audio

Video

Robot

dynamics

Generalization across modalities and tasks

Important if some tasks are low-resource

[Liang et al., MultiBench: Multiscale Benchmarks for Multimodal Representation Learning. NeurIPS 2021]

7 of 45

HighMMT

7

Transfer across partially observable modalities

HighMMT: unified model + parameter sharing + multitask and transfer learning

Language

Video

Audio

Audio

Video

Modality-specific embeddings

Standardized input sequence

HighMMT model

Shared multimodal model

Task-specific classifiers

Non-parallel multitask learning

Same model architecture!

Same parameters!

Video

Time-series

Video classification

Sentiment, emotions

Robot

dynamics

[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]

8 of 45

HighMMT

8

Traditional approaches: different model + different parameters

Robotic

manipulation

Disease

codes

Sarcasm

Humor

Image-text retrieval

Design

interface

Emotions

Language

Image

Speech

Video

Audio

Sensors

Proprioception

Time-series

Set

Table

Pareto front

All model combinations (>10,000)

[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]

9 of 45

HighMMT

9

Traditional approaches: different model + different parameters

Robotic

manipulation

Disease

codes

Sarcasm

Humor

Image-text retrieval

Design

interface

Emotions

Language

Image

Speech

Video

Audio

Sensors

Proprioception

Time-series

Set

Table

HighMMT

HighMMT

HighMMT

HighMMT

Pareto front

HighMMT single-task

All model combinations (>10,000)

[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]

10 of 45

HighMMT

10

Traditional approaches: different model + different parameters

Robotic

manipulation

Disease

codes

Sarcasm

Humor

Image-text retrieval

Design

interface

Emotions

Language

Image

Speech

Video

Audio

Sensors

Proprioception

Time-series

Set

Table

HighMMT multitask model

Pareto front

HighMMT single-task

HighMMT multitask

All model combinations (>10,000)

[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]

11 of 45

Quantifying Modality Heterogeneity

11

[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]

Information transfer, transfer learning perspective

1a. Estimate modality heterogeneity via transfer

1

Element representation

2

Element distribution

3

Structure

4

Information

5

Noise

Implicitly captures these:

12 of 45

Quantifying Modality Heterogeneity

12

[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]

Information transfer, transfer learning perspective

1a. Estimate modality heterogeneity via transfer

1b. Estimate interaction heterogeneity via transfer

0

1

0

3

2

0

1

2

3

0

5

4

6

3

0

2a. Compute modality heterogeneity matrix

{ }

{ }

{ }

{ }

{ }

0

1

0

3

2

0

1

2

4

0

{ }

{ }

{ }

2b. Compute interaction heterogeneity matrix

13 of 45

Quantifying Modality Heterogeneity

13

[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]

Information transfer, transfer learning perspective

3. Determine parameter clustering

0

1

0

3

2

0

1

2

3

0

5

4

6

3

0

2a. Compute modality heterogeneity matrix

{ }

{ }

{ }

{ }

{ }

0

1

0

3

2

0

1

2

4

0

{ }

{ }

{ }

2b. Compute interaction heterogeneity matrix

14 of 45

Quantifying Modality Heterogeneity

14

[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]

Information transfer, transfer learning perspective

1. Homogeneous Pre-training

2. Heterogeneity-aware Fine-tuning

15 of 45

HighMMT

15

HighMMT heterogeneity-aware sharing: estimate heterogeneity to determine parameter sharing

Robotic

manipulation

Disease

codes

Sarcasm

Humor

Image-text retrieval

Design

interface

Emotions

Language

Image

Speech

Video

Audio

Sensors

Proprioception

Time-series

Set

Table

HighMMT heterogeneity-aware sharing

Pareto front

HighMMT single-task

HighMMT multitask

All model combinations (>10,000)

HighMMT heterogeneity-aware

[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]

16 of 45

HighMMT

16

Transfer across partially observable modalities

HighMMT: unified model + parameter sharing + multitask and transfer learning

Disease

codes

Time-series

Table

Transfer

Language

Video

Audio

Audio

Video

HighMMT model

Video

Time-series

Video classification

Sentiment, emotions

Robot dynamics

[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]

17 of 45

HighMMT

17

Transfer across partially observable modalities

HighMMT: unified model + parameter sharing + multitask and transfer learning

67.7%

68.3%

68.5%

68.5%

Target task: MIMIC

# source tasks

2

1

0

3

(from different modalities, research areas, and tasks)

Achieves both multitask and transfer capabilities across modalities and tasks

63.3%

64.1%

65.5%

65.7%

2

Target task: UR-FUNNY

# source tasks

1

0

3

(from different modalities, research areas, and tasks)

[Liang et al., HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Learning. TMLR 2022]

18 of 45

18

Homogeneity

Heterogeneity

vs

Arbitrary tokenization

Beyond differentiable

interactions

Causal, logical, brain-inspired

Theoretical foundations of interactions

Future Direction: Heterogeneity & Interactions

Challenges:

19 of 45

19

[Liang et al., Quantifying & Modeling Feature Interactions: An Information Decomposition Framework. arXiv 2023]

Unique 1

Unique 2

Redundancy

Synergy

Quantifying Interactions

1. Dataset quantification:

2. Model quantification:

3. Model selection:

19

20 of 45

20

[Liang et al., Quantifying & Modeling Feature Interactions: An Information Decomposition Framework. arXiv 2023]

Quantifying Interactions

20

Classical Information Theory

Partial Information Decomposition

Can be negative!

21 of 45

21

[Liang et al., Quantifying & Modeling Feature Interactions: An Information Decomposition Framework. arXiv 2023]

Quantifying Interactions

21

Marginal-matching distributions

Only need p(x1,y) and p(x2,y) to infer R, U1, and U2

Need p(x1,x2) to infer S

22 of 45

22

[Liang et al., Quantifying & Modeling Feature Interactions: An Information Decomposition Framework. arXiv 2023]

Quantifying Interactions

22

Marginal-matching distributions

If X1, X2, Y have small and discrete support,

exact solution via convex programming with linear constraints.

For high-dimensional & continuous X1, X2, Y, an approximate neural network estimator.

23 of 45

23

[Liang et al., Quantifying & Modeling Feature Interactions: An Information Decomposition Framework. arXiv 2023]

Model Selection

23

Interaction polytope

1. Dataset quantification:

Can be done with synthetic data

24 of 45

[Liang et al., Quantifying & Modeling Feature Interactions: An Information Decomposition Framework. arXiv 2023]

Model Selection

24

Interaction polytope

2. Model quantification:

Model families trained on synthetic data

25 of 45

[Liang et al., Quantifying & Modeling Feature Interactions: An Information Decomposition Framework. arXiv 2023]

Model Selection

25

Interaction polytope

3. Model selection:

Selects models with >96% performance

26 of 45

26

seconds

or minutes

Short-term

Long-term

Compositionality

Memory

Personalization

Future Direction: Long-term

Challenges:

27 of 45

27

Perception

Reasoning

Generation

Social Intelligence

Multimodal

Interaction

Social-IQ

Multi-Party

Generation

Ethics

Future Direction: Interaction

Challenges:

28 of 45

28

Healthcare

Decision Support

Intelligent Interfaces and Vehicles

Online Learning

and Education

MultiViz

Challenges:

Fairness

Robustness

Generalization

Future Direction: Real-world

Interpretation

29 of 45

29

MultiViz: Visualizing & Interpreting Multimodal Models

How can we understand the modeling of heterogeneity and interconnections

and gain insights for safe real-world deployment?

Internal mechanics

30 of 45

MultiViz: Visualizing & Interpreting Multimodal Models

30

Is there a red shape above a circle?

Yes!

[Liang et al., MultiViz: Towards Visualizing and Understanding Multimodal Models. ICLR 2023, CHI 2023 Late Breaking Work]

How can we understand the modeling of heterogeneity and interconnections

and gain insights for safe real-world deployment?

31 of 45

MultiViz: Visualizing & Interpreting Multimodal Models

31

Is there a red shape above a circle?

Yes!

Is there a red shape above a circle?

Yes!

How can we understand the modeling of heterogeneity and interconnections

and gain insights for safe real-world deployment?

[Liang et al., MultiViz: Towards Visualizing and Understanding Multimodal Models. ICLR 2023, CHI 2023 Late Breaking Work]

32 of 45

MultiViz: Visualizing & Interpreting Multimodal Models

32

Unimodal importance: Does the model correctly identify keywords in the question?

Is there a red shape above a circle?

Yes!

1. Unimodal importance

Is there a red shape above a circle?

[Liang et al., MultiViz: Towards Visualizing and Understanding Multimodal Models. ICLR 2023, CHI 2023 Late Breaking Work]

33 of 45

MultiViz: Visualizing & Interpreting Multimodal Models

33

Cross-modal interactions: Does the model correctly relate the question with the image?

Is there a red shape above a circle?

Yes!

1. Unimodal importance

Is there a red shape above a circle?

2. Cross-modal interactions

Is there a red shape above a circle?

34 of 45

MultiViz: Visualizing & Interpreting Multimodal Models

34

Multimodal representations: Does the model consistently assign concepts to features?

1. Unimodal importance

Is there a red shape above a circle?

2. Cross-modal interactions

Is there a red shape above a circle?

Is there a red shape above a circle?

Yes!

3. Multimodal

representations

red

circle

35 of 45

MultiViz: Visualizing & Interpreting Multimodal Models

35

Multimodal prediction: Does the model correctly compose question and image information?

1. Unimodal importance

Is there a red shape above a circle?

2. Cross-modal interactions

Is there a red shape above a circle?

Is there a red shape above a circle?

Yes!

3. Multimodal

representations

4. Multimodal

prediction

Circle:

Above:

N/A

Red:

Yes!

red

circle

36 of 45

MultiViz: Visualizing & Interpreting Multimodal Models

36

How can we interpret cross-modal interactions in multimodal models?

2. Cross-modal interactions

Is there a red shape above a circle?

Natural second-order extension of gradient-based approaches!

Statistical non-additive interactions [Friedman & Popescu, 2008, Sorokina et al., 2008]

Also related: EMAP [Hessel et al., 2020], DIME [Lyu et al., 2022]

37 of 45

MultiViz: Visualizing & Interpreting Multimodal Models

37

How can we interpret cross-modal interactions in multimodal models?

The other small shiny thing that is the same shape as the tiny yellow shiny object is what color?

CLEVR

Flickr-30k

Three small dogs, two white and one black and white, on a sidewalk.

How many birds?

VQA 2.0

Correspondences

CMU-MOSEI

Why am I spending my money watching this? (sigh) I think I was more sad

Relationships

[Liang et al., MultiViz: Towards Visualizing and Understanding Multimodal Models. ICLR 2023, CHI 2023 Late Breaking Work]

38 of 45

MultiViz: Visualizing & Interpreting Multimodal Models

38

How can we understand multimodal representations?

What color is the tie of the second man to the left?

Local analysis

Global analysis

What color is the Salisbury Rd sign?

What color is the building?

What color are the checkers on the wall?

Blue

3. Multimodal

representations

color

39 of 45

Evaluating Interpretability

39

How can we evaluate the success of interpreting internal mechanics?

Problem: real-world datasets and models do not have

unimodal importance, cross-modal interactions, representations annotated!

Unimodal importance

Cross-modal

interactions

Multimodal

representations

Multimodal

prediction

[Liang et al., MultiViz: Towards Visualizing and Understanding Multimodal Models. ICLR 2023, CHI 2023 Late Breaking Work]

40 of 45

Evaluating Interpretability

40

How can we evaluate the success of interpreting internal mechanics?

  1. Model simulation

Can humans reproduce model predictions with high accuracy and agreement?

Unimodal importance

Cross-modal

interactions

Multimodal

representations

“Yes”

“Yes”

Multimodal

prediction

Open

challenges

[Liang et al., MultiViz: Towards Visualizing and Understanding Multimodal Models. ICLR 2023, CHI 2023 Late Breaking Work]

41 of 45

Evaluating Interpretability

41

How can we evaluate the success of interpreting internal mechanics?

65.0%

U + C

71.7%

U + C +

Local R +

Global R

61.7%

U + C +

Local R

U + C +

Local R +

Global R + P

81.7%

55.0%

U

MultiViz stages leads to higher accuracy and agreement

[Liang et al., MultiViz: Towards Visualizing and Understanding Multimodal Models. ICLR 2023, CHI 2023 Late Breaking Work]

42 of 45

Evaluating Interpretability

42

How can we evaluate the success of interpreting internal mechanics?

Find bugs

Fix bugs

2. Model debugging

Can humans find bugs in the model

for improvement?

Unimodal importance

Cross-modal

interactions

Multimodal

representations

Multimodal

prediction

Open

challenges

[Liang et al., MultiViz: Towards Visualizing and Understanding Multimodal Models. ICLR 2023, CHI 2023 Late Breaking Work]

43 of 45

Evaluating Interpretability

43

How can we understand multimodal representations?

What color is the tie of the second man to the left?

Local analysis

Global analysis

What color is the Salisbury Rd sign?

What color is the building?

What color are the checkers on the wall?

Blue

3. Multimodal

representations

“Models pick up cross-modal interactions but fail in identifying color!”

color

44 of 45

Evaluating Interpretability

44

How can we evaluate the success of interpreting internal mechanics?

+0.2%

Uncertainty

+30.5%

MultiViz

+1.4%

Random

MultiViz enables error analysis and debugging of multimodal models

“Models pick up cross-modal interactions but fail in identifying color!”

Add targeted examples involving color.

Side note: we used this to discover a bug in a popular deep learning code repository.

[Liang et al., MultiViz: Towards Visualizing and Understanding Multimodal Models. ICLR 2023, CHI 2023 Late Breaking Work]

45 of 45

45

What is Multimodal?

Heterogeneous

Connected

Interacting

Why is it hard?

What is next?

Representation

Alignment

Reasoning

Generation

Transference

Quantification

High-modality

Heterogeneity

Long-term

Interaction

Real-world

Liang, Zadeh, and Morency. Foundations and Trends on Multimodal Machine Learning. 2022