Multimodal Learning Part 2:�Alignment, Transformers, Pre-training
Paul Pu Liang
Machine Learning Department
Carnegie Mellon University
2
What is Multimodal?
Heterogeneous
Connected
Interacting
Why is it hard?
What is next?
Multimodal is the scientific study of heterogeneous and interconnected data ☺
3
Core Multimodal Challenges
Representation
Alignment
Transference
Generation
Quantification
Reasoning
4
Challenge 1: Representation
Fusion
Coordination
Fission
# modalities > # representations
# modalities = # representations
# modalities < # representations
Sub-challenges:
Definition: Learning representations that reflect cross-modal interactions
between individual elements, across different modalities.
[Liang, Zadeh, and Morency. Foundations and Recent Trends on Multimodal Machine Learning. arXiv 2022]
Objectives of today’s class
5
Challenge 2: Alignment
Sub-challenges:
Definition: Identifying and modeling cross-modal connections between all elements of multiple modalities, building from the data structure
Discrete
Alignment
Contextualized
Representation
Continuous
Alignment
Discrete elements and connections
Alignment
Segmentation and continuous warping
+ representation
6
6
Discrete & continuous alignment
7
Sub-Challenge 2a: Discrete Alignment
Local
Global
Definition: Identify and model connections between elements of multiple modalities
Undirected
Directed
8
Connections
Correspondence
e.g., grounding
laptop
Association
e.g., correlation, co-occurrence
=
Why should 2 elements be connected?
Dependency
e.g., causal, temporal
Relationship
used for
e.g., function
Statistical
Semantic
9
Language Grounding
A woman reading newspaper
Statistical
Semantic
Association
e.g., correlation, co-occurrence
Dependency
e.g., causal, temporal
Correspondence
e.g., grounding
Relationship
=
laptop
used for
e.g., function
Definition: Tying language (words, phrases,…) to non-linguistic elements, such as the visual world (objects, people, …)
10
Local Alignment – Coordinated Representations
A woman reading newspaper
Supervision: Paired data
1
2
1
2
3
4
3
4
Visual
Language
Modality A
Modality B
encoder
encoder
Learning coordinated representations:
or contrastive learning
Similarity function
Common information
11
11
Directed Alignment
A woman is throwing a frisbee
(query)
(key)
2
Hard attention
1
Soft attention
Modality A
Modality B
Which object?
Attention
12
Directed Alignment – Image Captioning
Should we always use the final layer of the CNN for all generated words?
A
woman
is
throwing
(query)
(key)
Modality A
Modality B
13
13
Directed Alignment – Image Captioning
Distribution over L locations
Expectation over features: D
First word
Output word
14
Attention Gates
15
15
Attention Gates
expectation of the context (a fancy way to say it’s a weighted average)
16
16
Example – Image Captioning
[Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, Xu et al., 2015]
17
17
Global Alignment
Visual
Language
Modality A
Modality B
encoder
encoder
(representation)
Coordination function
(global alignment)
Jointly optimize representation + global alignment:
No pairing information
18
18
Assignment Problem
Bipartite Graph
Similarity weights:
Assignment:
Maximize:
Initial assumptions:
(vector of indices)
How to solve?
Naive solution: check all assignments
Better solution: Linear Programming
Can be solved with simplex algorithm
19
19
Optimal transport
Bipartite Graph
New assumptions:
Similarity weights:
Assignments:
Maximize:
It can be seen as “transporting” elements from modality A to modality B (and vice-versa)
Wasserstein distance give optimal transport
20
20
Challenge 2b: Continuous Alignment
Definition: Model alignment between modalities with continuous signals and no explicit elements
Continuous warping
Discretization
(segmentation)
21
Continuous Warping – Example
Aligning video sequences
22
Dynamic Time Warping (DTW)
Find set of indices to minimize the alignment difference:
Dynamic Time Warping is designed to find these index vectors!
23
Dynamic Time Warping (DTW)
Lowest cost path in a cost matrix
Solved using dynamic programming while respecting the restrictions
24
DTW alternative formulation
Replication doesn’t change the objective!
Alternative objective:
A differentiable version of DTW also exists…
25
Discretization (aka Segmentation)
objects
Common assumptions:
Segmented elements
1
Images
?
?
?
?
?
?
Signals
Medical imaging
Examples:
26
Discretization – Example
t ah m aa t ow
Spectogram
Phonemes
How can we predict the sequence of phoneme labels?
Sequence Labeling and Alignment
27
Discretization – Example
t ah m aa t ow
Spectogram
Phonemes
Challenge: many-to-1 alignment
t ah m aa
How can we predict the sequence of phoneme labels?
Sequence Labeling and Alignment
28
Discretization – A Classification Approach
t ah m aa t ow
softmax
1
Output activations (distribution):
CTC
…
…
…
…
…
2
3
4
Most probable sequence labels
for ‘blank’ or no label
Connectionist Temporal Classification
Grave et al., Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks, ICML 2006
29
Discretization and Representation – Cluster-based Approaches
Self-attention Transformer
K-mean
clustering
linear
linear
HUBERT: Hidden-Unit BERT
Hsu et al., HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units, arxiv 2021
Speech
30
Contextualized (aligned) representations
31
Self-Attention
Σ
I
do
not
like
it
Self-attention Module
32
Self-Attention
I
do
not
like
it
Σ
Self-attention Module
33
Transformer Self-Attention
Σ
Σ
Σ
Σ
Σ
…
…
…
…
…
I
do
not
like
it
Value
Key
Query
34
Transformer Self-Attention
Σ
Σ
Σ
Σ
Σ
…
…
…
…
…
I
do
not
like
it
Scale dot-product attention weights
Value
Key
Query
35
Transformer Self-Attention
Σ
Σ
Σ
Σ
Σ
…
…
…
…
…
I
do
not
like
it
Value
Key
Query
Scale dot-product attention weights
36
Transformer Self-Attention
Σ
Σ
Σ
Σ
Σ
…
…
…
…
…
I
do
not
like
it
Value
Key
Query
Scale dot-product attention weights
37
Transformer Self-Attention
I
do
not
like
it
Transformer’s Self-Attention Layer
38
Transformer Multi-Head Self-Attention
Transformer’s Self-Attention Layer
Transformer’s Self-Attention Layer
I
do
not
like
it
Transformer’s Self-Attention Layer
Linear projection
39
Transformer Multi-Head Self-Attention
I
do
not
like
it
Transformer’s Multi-Head
Self-Attention Layer
40
Sequence-to-Sequence Modeling
I
do
not
like
it
How can we perform seq2seq translation with transformer attention?
Je
n'
aime
pas
cela
41
Seq2Seq with Transformer Attentions
I
do
not
like
it
self-attention
Je
n'
aime
pas
cela
42
Seq2Seq with Transformer Attentions
I
do
not
like
it
self-attention
Je
n'
aime
pas
START
“masked” self-attention
Je
n'
aime
pas
cela
43
Seq2Seq with Transformer Attentions
I
do
not
like
it
Je
n'
aime
pas
cela
self-attention
Je
n'
aime
pas
START
“masked” self-attention
Transformer attention
Value
Key
Query
How should we connect the encoder and decoder self-attention to the transformer attention?
Vector format
44
Seq2Seq with Transformer Attentions
I
do
not
like
it
Je
n'
aime
pas
cela
self-attention
Je
n'
aime
pas
START
“masked” self-attention
Transformer attention
Value
Key
Query
“encoder-decoder”
transformer
“encoder”
transformer
“decoder”
transformer
45
Token-level and Sentence-level Embeddings
I
do
not
like
it
Token-level embeddings
I
do
not
like
it
Sentence-level embedding
46
Pre-Training and Fine-Tuning
I
do
not
like
it
Pre-training
I
do
not
like
it
Fine-Tuning
Initialize
parameters
(e.g., language model)
47
Fine-tuning Approaches
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
48
Vision Transformer (ViT)
https://arxiv.org/abs/2010.11929
Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv (2020).
49
Vision Transformer (ViT)
Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv (2020).
16x16 image patches
Flattening the image patches
Embedding for the whole image
50
Masked Auto-Encoder (MAE)
He et al., Masked Autoencoders Are Scalable Vision Learners, CVPR 2022
Visual Transformer (ViT)
Mask a random subset (~70%)
Transformer
Only used during
pre-training
Reconstruction loss function over the whole image
51
Masked Auto-Encoder (MAE)
He et al., Masked Autoencoders Are Scalable Vision Learners, CVPR 2022
52
Multimodal Transformers
53
Multimodal Embeddings
I really liked it this time
Language
Visual
Acoustic
How to learn contextualized representations from multiple modalities?
Option 1: Concatenate modalities and learn BERT transformer
54
Simple Solution: Contextualized Multimodal Embeddings
I really liked it this time
Language
Visual
Acoustic
Transformer self-attention
55
VisualBERT
https://arxiv.org/abs/1908.03557
Li, Liunian Harold, et al. "Visualbert: A simple and performant baseline for vision and language." arXiv (2019).
56
UNITER
https://arxiv.org/abs/1909.11740v1
Chen, Yen-Chun, et al. "Uniter: Universal image-text representation learning." European conference on computer vision. 2020.
Similar Transformer architecture to BERT and VisualBERT…
but with slightly different optimization
Masking words
Masking images
Does the sentence matches the image?
57
Visual-and-Language Transformer (ViLT)
https://arxiv.org/abs/2102.03334
Optimal transport: global alignment
58
Visual-and-Language Transformer (ViLT)
Example of alignment between modalities:
https://arxiv.org/abs/2102.03334
59
Multimodal Embeddings
I really liked it this time
Language
Visual
Acoustic
How to learn contextualized representations from multiple modalities?
Option 2: Look at pairwise interactions between modalities
60
Multimodal Transformer – Pairwise Cross-Modal
Visual
Vocal
Verbal
“I like…”
Cross-Modal Attention Block
Cross-modally Contextualized
Unimodal Representations
x N layers
Multimodal representation
Unimodal Representations
61
Visual
Verbal
“I like…”
time
“spectacle”
Similarities:
(attentions)
Visually contextualizing the verbal modality
62
Visual
Verbal
“I like…”
time
“spectacle”
Residual connection
Similarities:
(attentions)
“ New visually-contextualized representation of language”
Visual embeddings:
Visually contextualizing the verbal modality
Correlated visual information
63
Tsai et al., Multimodal Transformer for Unaligned Multimodal Language Sequences, ACL 2019
64
ViLBERT
Lu, Jiasen, et al. "Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks." arXiv (August 6, 2019).
Cross-Modal Transformer Modules
Unimodal Transformer
65
LXMERT
https://arxiv.org/abs/1908.07490v1
Tan, Hao, and Mohit Bansal. "Lxmert: Learning cross-modality encoder representations from transformers." arXiv (August 20, 2019).
Cross-Modal Transformer Modules
Why?
66
Video-based Representation and Alignment
https://www.di.ens.fr/willow/research/howto100m/
HowTo100M benchmark dataset
67
Visual Representations from Uncurated Instructional Videos
End-to-End Learning of Visual Representations from Uncurated Instructional Videos �Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman – CVPR 2020
it’s turning into a much thicker mixture
… by taking advantage of large-scale video+language resources
The biggest mistake is not kneading it enough
…
Goal: Learn better visual representations…
Instructional videos
(weakly-paired data)
68
Another Approach for Weakly-Paired Video Data
https://arxiv.org/abs/1904.01766
How do we get visual words now?
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, Cordelia Schmid; VideoBERT: A Joint Model for Video and Language Representation Learning ICCV, 2019
K-mean clustering
+ centroid
69
ActBERT
Zhu and Yang, ActBERT: Learning Global-Local Video-Text Representations, CVPR 2020
70
ALBEF: Align Before Fusion
Li et al., Align before Fuse: Vision and Language Representation Learning with Momentum Distillation, Neurips 2021
Alignment of
segment-level embeddings
Related to mutual information
Close alignment
71
Conditioning language models
72
Autoregressive Models
Autoregressive language models
[Brown et al., Language Models are Few-shot Learners. NeurIPS 2020]
73
73
Autoregressive Models
Autoregressive audio generation models
[van den Oord et al., WaveNet: A Generative Model for Raw Audio. ICML 2016]
74
74
Conditioning Autoregressive Models
We typically want p(x|c) - conditional generation
We might also care about p(x2|x1,c) - style transfer
75
Conditioning Autoregressive Models
[Tsimpoukelli et al., Multimodal Few-Shot Learning with Frozen Language Models. NeurIPS 2021]
Conditioning via prefix tuning
Modeling p(x|c):
A small red boat on the water.
Adapted + pretrained
Adapter
Pretrained
A small red boat on the water.
p(x)
p(x|c)
76
Conditioning Autoregressive Models
[Tsimpoukelli et al., Multimodal Few-Shot Learning with Frozen Language Models. NeurIPS 2021]
Conditioning via prefix tuning
0-shot VQA:
What color is the car?
Pretrained
Adapted + pretrained
Adapter
Blue
p(x)
p(x|c)
77
Conditioning Autoregressive Models
[Tsimpoukelli et al., Multimodal Few-Shot Learning with Frozen Language Models. NeurIPS 2021]
Conditioning via prefix tuning
Adapted + pretrained
Adapter
Adapter
Steve Jobs
1-shot outside knowledge VQA:
Recall reasoning – leverage implicit knowledge in LMs
p(x)
p(x|c)
78
Conditioning Autoregressive Models
[Tsimpoukelli et al., Multimodal Few-Shot Learning with Frozen Language Models. NeurIPS 2021]
Conditioning via prefix tuning
Few-shot image classification:
Adapted + pretrained
Adapter
Adapter
Adapter
This is a dax.
79
Conditioning Autoregressive Models
[Zhu et al., MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models. 2023]
Conditioning via prefix tuning
Mini-GPT4
Stage 1: Alignment using paired image-text data.
Stage 2: Instruction tuning using image + text instructions and example completions.
80
Reminder: Modality-Shifting Fusion
Primary
modality
Secondary
modalities
gate
shift
Wang et al., Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors, AAAI 2019
Example with language modality:
Primary modality: language
Secondary modalities: acoustic and visual
word: “expectations”
Negative-shifted representation
Positive-shifted representation
81
81
Modality-Shifting with Transformers
Transformer self-attention
Transformer self-attention
Rahman et al., Integrating Multimodal Information in Large Pretrained Transformers, ACL 2020
Multimodal Adaptation Gate (MAG) + BERT
82
Recap
83