Visio-Linguistic Brain Encoding
Subba Reddy Oota1,2, Jashn Arora2, Vijay Rowtula2, Manish Gupta2,3, Bapi Raju Surampudi2
August 13, 2022
COLING 2022
1Inria Bordeaux France, 2IIIT-Hyderabad, 3Microsoft India
What is fMRI?
2
https://www.biopac.com/events/fmri-psych/
A vision-language task in the scanner
Concept + Picture
fMRI Brain Activity
Brain Encoding vs Decoding
3
Encoding
Decoding
Stimulus
Representation
Stimulus
Representation
fMRI
fMRI
Concept + Picture
What is Brain Encoding?
4
Schrimpf et al. 2021 fMRI
Present
Stimulus
What is Brain Encoding?
5
Schrimpf et al. 2021 fMRI
Present
Stimulus
Present
Stimulus
What is Brain Encoding?
6
Schrimpf et al. 2021
Present
Stimulus
Stimulus
Ridge Regression
Present
Input
Output
X
Y
W
Pearson Correlation (R) = Corr(Y, W(X))
Most popular models are Transformers
7
Transformer language models
Vaswani et al. 2017, Dosovitskiy et al. 2021, Harold Li et al. 2019
Vision Transformer (ViT)
Multi-modal Transformer
Brain encoding for single-mode stimuli: Vision
8
Shrimpf et al. 2019, Shrimpf et al. 2019, Wang et al. 2019
9
Vaswani et al. 2017, Gauthier et al. 2019, Schrimpf et al. 2021
Transformer language models
(BERT, XLM, GPT,…)
Brain encoding for single-mode stimuli: Text
Can image-based and multi-model Transformers accurately perform fMRI encoding?
Dosovitskiy et al. 2021, Tan et al. 2019, Harold Li et al. 2019
Models used: Multi-Modal Transformers
CLIP
LXMERT
VisualBERT
Radford et al. 2021, Tan et al. 2019, Harold Li et al. 2019
Models used: Image Transformers
ViT
BEiT
DEiT
Dosovitskiy et al. 2021, Hangbo et al. 2021, Touvron et al. 2021
Models used: CNNs
VGGNET
RESNET50
InceptionV2
EfficientNET
Simonyan et al. 2014, He et al. 2016, Szegedy et al. 2017, Tan et 2019
Dataset Details
14
Pereira
Pereira et al. 2018, Nadine et al. 2019
BOLD5000
Concept+Picture (Bird)
Evaluation Metrics: 2V2 and Pearson
15
2V2 Accuracy
Cosine distance
Toneva et al. 2020
Encoding performance (BOLD5000)
Encoding performance (Pereira)
Model size vs Efficacy
Pereira
BOLD5000
Single Stream vs Dual Stream
Dual Stream
Single Stream
Is Linguistic Information Important in�Multi-Modal Transformers?
Randomize Image-Text pairs
Correct Image-Text pairs
Does Language Influence Vision?
Collaborators
Subba Reddy Oota
Jashn Arora
Vijay Rowtula
Manish Gupta
Bapi Raju Surampudi