1 of 83

Multimodal Learning Part 2:�Alignment, Transformers, Pre-training

Paul Pu Liang

Machine Learning Department

Carnegie Mellon University

2 of 83

2

What is Multimodal?

Heterogeneous

Connected

Interacting

Why is it hard?

What is next?

Multimodal is the scientific study of heterogeneous and interconnected data ☺

 

3 of 83

3

Core Multimodal Challenges

Representation

Alignment

Transference

Generation

Quantification

Reasoning

 

4 of 83

4

Challenge 1: Representation

Fusion

Coordination

Fission

# modalities > # representations

# modalities = # representations

# modalities < # representations

Sub-challenges:

Definition: Learning representations that reflect cross-modal interactions

between individual elements, across different modalities.

[Liang, Zadeh, and Morency. Foundations and Recent Trends on Multimodal Machine Learning. arXiv 2022]

5 of 83

Objectives of today’s class

  • Discrete and continuous alignment
  • Contextualized representations
    • Recap: transformers and pre-training
    • Multimodal transformers and pretraining
    • Conditioning language-only models

5

6 of 83

Challenge 2: Alignment

Sub-challenges:

Definition: Identifying and modeling cross-modal connections between all elements of multiple modalities, building from the data structure

Discrete

Alignment

Contextualized

Representation

Continuous

Alignment

Discrete elements and connections

Alignment

Segmentation and continuous warping

+ representation

6

6

7 of 83

Discrete & continuous alignment

7

8 of 83

Sub-Challenge 2a: Discrete Alignment

Local

Global

Definition: Identify and model connections between elements of multiple modalities

Undirected

Directed

8

9 of 83

Connections

Correspondence

e.g., grounding

laptop

Association

e.g., correlation, co-occurrence

=

Why should 2 elements be connected?

Dependency

e.g., causal, temporal

Relationship

used for

e.g., function

Statistical

Semantic

9

10 of 83

Language Grounding

A woman reading newspaper

Statistical

Semantic

Association

e.g., correlation, co-occurrence

Dependency

e.g., causal, temporal

Correspondence

e.g., grounding

Relationship

=

laptop

used for

e.g., function

Definition: Tying language (words, phrases,…) to non-linguistic elements, such as the visual world (objects, people, …)

10

11 of 83

Local Alignment – Coordinated Representations

A woman reading newspaper

Supervision: Paired data

1

2

1

2

3

4

3

4

Visual

Language

Modality A

Modality B

encoder

encoder

 

 

 

 

Learning coordinated representations:

or contrastive learning

 

Similarity function

Common information

11

11

12 of 83

Directed Alignment

A woman is throwing a frisbee

(query)

(key)

2

Hard attention

1

Soft attention

Modality A

Modality B

Which object?

Attention

12

13 of 83

Directed Alignment – Image Captioning

Should we always use the final layer of the CNN for all generated words?

A

woman

is

throwing

(query)

(key)

Modality A

Modality B

13

13

14 of 83

Directed Alignment – Image Captioning

Distribution over L locations

Expectation over features: D

 

 

 

 

 

 

 

 

 

 

 

 

First word

Output word

14

15 of 83

Attention Gates

  •  

 

15

15

16 of 83

Attention Gates

  •  

expectation of the context (a fancy way to say it’s a weighted average)

16

16

17 of 83

Example – Image Captioning

[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, Xu et al., 2015]

17

17

18 of 83

Global Alignment

Visual

Language

Modality A

Modality B

encoder

encoder

 

 

 

 

(representation)

 

Coordination function

(global alignment)

Jointly optimize representation + global alignment:

 

No pairing information

18

18

19 of 83

Assignment Problem

 

Bipartite Graph

 

 

Similarity weights:

Assignment:

 

 

 

Maximize:

Initial assumptions:

  • Same number of elements in A and B modalities
  • 1-to-1 “hard” alignment between elements
  • All elements assigned (aka “perfect matching”)

(vector of indices)

How to solve?

Naive solution: check all assignments

Better solution: Linear Programming

 

 

 

Can be solved with simplex algorithm

19

19

20 of 83

Optimal transport

 

Bipartite Graph

 

 

New assumptions:

  • Different number of elements in A and B modalities
  • Many-to-many “soft” alignment between elements

Similarity weights:

Assignments:

Maximize:

 

 

 

It can be seen as “transporting” elements from modality A to modality B (and vice-versa)

Wasserstein distance give optimal transport

20

20

21 of 83

Challenge 2b: Continuous Alignment

Definition: Model alignment between modalities with continuous signals and no explicit elements

Continuous warping

Discretization

(segmentation)

21

22 of 83

Continuous Warping – Example

Aligning video sequences

22

23 of 83

Dynamic Time Warping (DTW)

  •  

 

Find set of indices to minimize the alignment difference:

Dynamic Time Warping is designed to find these index vectors!

 

23

24 of 83

Dynamic Time Warping (DTW)

 

 

 

Lowest cost path in a cost matrix

  • Restrictions?
    • Monotonicity – no going back in time
    • Continuity - no gaps
    • Boundary conditions - start and end at the same points
    • Warping window - don’t get too far from diagonal
    • Slope constraint – do not insert or skip too much

Solved using dynamic programming while respecting the restrictions

24

25 of 83

DTW alternative formulation

Replication doesn’t change the objective!

 

 

 

 

Alternative objective:

 

 

 

 

 

 

A differentiable version of DTW also exists…

https://arxiv.org/pdf/1703.01541.pdf

25

26 of 83

Discretization (aka Segmentation)

objects

Common assumptions:

Segmented elements

1

Images

?

?

?

?

?

?

Signals

Medical imaging

Examples:

26

27 of 83

Discretization – Example

t ah m aa t ow

Spectogram

Phonemes

How can we predict the sequence of phoneme labels?

Sequence Labeling and Alignment

27

28 of 83

Discretization – Example

t ah m aa t ow

Spectogram

Phonemes

Challenge: many-to-1 alignment

t ah m aa

How can we predict the sequence of phoneme labels?

Sequence Labeling and Alignment

28

29 of 83

Discretization – A Classification Approach

t ah m aa t ow

 

 

softmax

1

Output activations (distribution):

CTC

 

 

 

2

 

3

 

 

4

Most probable sequence labels

for ‘blank’ or no label

Connectionist Temporal Classification

Grave et al., Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks, ICML 2006

29

30 of 83

Discretization and Representation – Cluster-based Approaches

Self-attention Transformer

 

 

 

 

 

K-mean

clustering

 

 

 

 

 

 

 

 

 

 

linear

linear

HUBERT: Hidden-Unit BERT

Hsu et al., HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units, arxiv 2021

Speech

30

31 of 83

Contextualized (aligned) representations

31

32 of 83

Self-Attention

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Σ

I

 

do

 

not

 

like

 

it

 

Self-attention Module

32

33 of 83

Self-Attention

I

 

do

 

not

 

like

 

it

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Σ

 

 

 

 

 

Self-attention Module

33

34 of 83

Transformer Self-Attention

 

 

 

 

 

 

 

 

 

 

Σ

Σ

Σ

Σ

Σ

 

 

 

 

 

I

 

do

 

not

 

like

 

it

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Value

Key

Query

 

34

35 of 83

Transformer Self-Attention

 

 

 

 

 

 

 

 

 

 

Σ

Σ

Σ

Σ

Σ

 

 

 

 

 

I

 

do

 

not

 

like

 

it

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Scale dot-product attention weights

Value

Key

Query

 

35

36 of 83

Transformer Self-Attention

 

 

 

 

 

 

 

 

 

 

Σ

Σ

Σ

Σ

Σ

 

 

 

 

 

I

 

do

 

not

 

like

 

it

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Value

Key

Query

Scale dot-product attention weights

 

36

37 of 83

Transformer Self-Attention

 

 

 

 

 

 

 

 

 

 

Σ

Σ

Σ

Σ

Σ

 

 

 

 

 

I

 

do

 

not

 

like

 

it

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Value

Key

Query

 

Scale dot-product attention weights

37

38 of 83

Transformer Self-Attention

I

 

do

 

not

 

like

 

it

 

Transformer’s Self-Attention Layer

 

 

 

 

 

 

 

 

 

38

39 of 83

Transformer Multi-Head Self-Attention

 

 

 

 

 

Transformer’s Self-Attention Layer

 

 

 

 

 

Transformer’s Self-Attention Layer

I

 

do

 

not

 

like

 

it

 

 

 

 

 

 

Transformer’s Self-Attention Layer

 

 

 

 

 

 

 

 

Linear projection

39

40 of 83

Transformer Multi-Head Self-Attention

I

 

do

 

not

 

like

 

it

 

Transformer’s Multi-Head

Self-Attention Layer

 

 

 

 

 

 

 

 

 

 

 

 

 

 

40

41 of 83

Sequence-to-Sequence Modeling

I

do

not

like

it

 

 

 

 

 

How can we perform seq2seq translation with transformer attention?

Je

n'

aime

pas

cela

 

 

 

 

 

41

42 of 83

Seq2Seq with Transformer Attentions

I

do

not

like

it

 

 

 

 

 

 

 

 

 

 

self-attention

Je

n'

aime

pas

cela

 

 

 

 

 

42

43 of 83

Seq2Seq with Transformer Attentions

I

do

not

like

it

 

 

 

 

 

 

 

 

 

 

self-attention

 

 

 

 

 

Je

n'

aime

pas

START

“masked” self-attention

 

 

 

 

 

Je

n'

aime

pas

cela

 

 

 

 

 

43

44 of 83

Seq2Seq with Transformer Attentions

I

do

not

like

it

 

 

 

 

 

Je

n'

aime

pas

cela

 

 

 

 

 

 

self-attention

 

 

 

 

 

Je

n'

aime

pas

START

“masked” self-attention

 

Transformer attention

 

 

 

Value

Key

Query

How should we connect the encoder and decoder self-attention to the transformer attention?

Vector format

44

45 of 83

Seq2Seq with Transformer Attentions

I

do

not

like

it

 

 

 

 

 

Je

n'

aime

pas

cela

 

 

 

 

 

 

self-attention

 

 

 

 

 

Je

n'

aime

pas

START

“masked” self-attention

 

Transformer attention

 

 

 

Value

Key

Query

“encoder-decoder”

transformer

“encoder”

transformer

“decoder”

transformer

45

46 of 83

Token-level and Sentence-level Embeddings

I

do

not

like

it

 

 

 

 

 

 

 

 

 

 

Token-level embeddings

I

do

not

like

it

 

 

 

 

 

 

Sentence-level embedding

46

47 of 83

Pre-Training and Fine-Tuning

I

do

not

like

it

 

 

 

 

 

 

 

 

 

 

Pre-training

I

do

not

like

it

 

 

 

 

 

 

 

 

 

 

Fine-Tuning

 

 

 

 

 

Initialize

parameters

(e.g., language model)

47

48 of 83

Fine-tuning Approaches

https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

48

49 of 83

Vision Transformer (ViT)

https://arxiv.org/abs/2010.11929

Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv (2020).

49

50 of 83

Vision Transformer (ViT)

Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv (2020).

16x16 image patches

Flattening the image patches

Embedding for the whole image

50

51 of 83

Masked Auto-Encoder (MAE)

He et al., Masked Autoencoders Are Scalable Vision Learners, CVPR 2022

Visual Transformer (ViT)

Mask a random subset (~70%)

Transformer

Only used during

pre-training

Reconstruction loss function over the whole image

51

52 of 83

Masked Auto-Encoder (MAE)

He et al., Masked Autoencoders Are Scalable Vision Learners, CVPR 2022

52

53 of 83

Multimodal Transformers

53

54 of 83

Multimodal Embeddings

I really liked it this time

Language

Visual

Acoustic

 

 

 

How to learn contextualized representations from multiple modalities?

Option 1: Concatenate modalities and learn BERT transformer

54

55 of 83

Simple Solution: Contextualized Multimodal Embeddings

I really liked it this time

Language

Visual

Acoustic

 

 

 

Transformer self-attention

55

56 of 83

VisualBERT

https://arxiv.org/abs/1908.03557

Li, Liunian Harold, et al. "Visualbert: A simple and performant baseline for vision and language." arXiv (2019).

56

57 of 83

UNITER

https://arxiv.org/abs/1909.11740v1

Chen, Yen-Chun, et al. "Uniter: Universal image-text representation learning." European conference on computer vision. 2020.

Similar Transformer architecture to BERT and VisualBERT…

but with slightly different optimization

Masking words

Masking images

Does the sentence matches the image?

57

58 of 83

Visual-and-Language Transformer (ViLT)

https://arxiv.org/abs/2102.03334

 

Optimal transport: global alignment

58

59 of 83

Visual-and-Language Transformer (ViLT)

Example of alignment between modalities:

https://arxiv.org/abs/2102.03334

59

60 of 83

Multimodal Embeddings

I really liked it this time

Language

Visual

Acoustic

 

 

 

How to learn contextualized representations from multiple modalities?

Option 2: Look at pairwise interactions between modalities

60

61 of 83

Multimodal Transformer – Pairwise Cross-Modal

Visual

Vocal

Verbal

“I like…”

Cross-Modal Attention Block

Cross-modally Contextualized

Unimodal Representations

x N layers

Multimodal representation

Unimodal Representations

61

62 of 83

 

Visual

Verbal

“I like…”

time

“spectacle”

Similarities:

(attentions)

Visually contextualizing the verbal modality

62

63 of 83

 

Visual

Verbal

“I like…”

time

“spectacle”

Residual connection

Similarities:

(attentions)

“ New visually-contextualized representation of language”

Visual embeddings:

Visually contextualizing the verbal modality

Correlated visual information

63

64 of 83

 

Tsai et al., Multimodal Transformer for Unaligned Multimodal Language Sequences, ACL 2019

64

65 of 83

ViLBERT

Lu, Jiasen, et al. "Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks." arXiv (August 6, 2019).

Cross-Modal Transformer Modules

Unimodal Transformer

65

66 of 83

LXMERT

https://arxiv.org/abs/1908.07490v1

Tan, Hao, and Mohit Bansal. "Lxmert: Learning cross-modality encoder representations from transformers." arXiv (August 20, 2019).

Cross-Modal Transformer Modules

Why?

66

67 of 83

Video-based Representation and Alignment

https://www.di.ens.fr/willow/research/howto100m/

HowTo100M benchmark dataset

67

68 of 83

Visual Representations from Uncurated Instructional Videos

End-to-End Learning of Visual Representations from Uncurated Instructional Videos �Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman – CVPR 2020

it’s turning into a much thicker mixture

… by taking advantage of large-scale video+language resources

The biggest mistake is not kneading it enough

Goal: Learn better visual representations…

Instructional videos

(weakly-paired data)

68

69 of 83

Another Approach for Weakly-Paired Video Data

https://arxiv.org/abs/1904.01766

How do we get visual words now?

Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, Cordelia Schmid; VideoBERT: A Joint Model for Video and Language Representation Learning ICCV, 2019

K-mean clustering

+ centroid

69

70 of 83

ActBERT

Zhu and Yang, ActBERT: Learning Global-Local Video-Text Representations, CVPR 2020

70

71 of 83

ALBEF: Align Before Fusion

Li et al., Align before Fuse: Vision and Language Representation Learning with Momentum Distillation, Neurips 2021

 

Alignment of

segment-level embeddings

Related to mutual information

Close alignment

71

72 of 83

Conditioning language models

72

73 of 83

Autoregressive Models

Autoregressive language models

[Brown et al., Language Models are Few-shot Learners. NeurIPS 2020]

73

73

74 of 83

Autoregressive Models

Autoregressive audio generation models

[van den Oord et al., WaveNet: A Generative Model for Raw Audio. ICML 2016]

74

74

75 of 83

Conditioning Autoregressive Models

We typically want p(x|c) - conditional generation

  • c is a category (e.g. faces, outdoor scenes) from which we want to generate images
  • c is an image which we want to describe in natural language

We might also care about p(x2|x1,c) - style transfer

  • c is a stylistic change e.g. negative to positive

75

76 of 83

Conditioning Autoregressive Models

[Tsimpoukelli et al., Multimodal Few-Shot Learning with Frozen Language Models. NeurIPS 2021]

Conditioning via prefix tuning

Modeling p(x|c):

A small red boat on the water.

Adapted + pretrained

Adapter

Pretrained

A small red boat on the water.

p(x)

p(x|c)

76

77 of 83

Conditioning Autoregressive Models

[Tsimpoukelli et al., Multimodal Few-Shot Learning with Frozen Language Models. NeurIPS 2021]

Conditioning via prefix tuning

0-shot VQA:

What color is the car?

Pretrained

Adapted + pretrained

Adapter

Blue

p(x)

p(x|c)

77

78 of 83

Conditioning Autoregressive Models

[Tsimpoukelli et al., Multimodal Few-Shot Learning with Frozen Language Models. NeurIPS 2021]

Conditioning via prefix tuning

Adapted + pretrained

Adapter

Adapter

Steve Jobs

1-shot outside knowledge VQA:

Recall reasoning – leverage implicit knowledge in LMs

p(x)

p(x|c)

78

79 of 83

Conditioning Autoregressive Models

[Tsimpoukelli et al., Multimodal Few-Shot Learning with Frozen Language Models. NeurIPS 2021]

Conditioning via prefix tuning

Few-shot image classification:

Adapted + pretrained

Adapter

Adapter

Adapter

This is a dax.

79

80 of 83

Conditioning Autoregressive Models

[Zhu et al., MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models. 2023]

Conditioning via prefix tuning

Mini-GPT4

Stage 1: Alignment using paired image-text data.

Stage 2: Instruction tuning using image + text instructions and example completions.

80

81 of 83

Reminder: Modality-Shifting Fusion

Primary

modality

Secondary

modalities

 

 

 

gate

 

shift

Wang et al., Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors, AAAI 2019

Example with language modality:

Primary modality: language

Secondary modalities: acoustic and visual

word: “expectations”

Negative-shifted representation

Positive-shifted representation

81

81

82 of 83

Modality-Shifting with Transformers

Transformer self-attention

Transformer self-attention

Rahman et al., Integrating Multimodal Information in Large Pretrained Transformers, ACL 2020

Multimodal Adaptation Gate (MAG) + BERT

82

83 of 83

Recap

  • Discrete and continuous alignment
  • Contextualized representations
    • Recap: transformers and pre-training
    • Multimodal transformers and pretraining
    • Conditioning language-only models

83