1 of 22

Visio-Linguistic Brain Encoding

Subba Reddy Oota1,2, Jashn Arora2, Vijay Rowtula2, Manish Gupta2,3, Bapi Raju Surampudi2

August 13, 2022

COLING 2022

1Inria Bordeaux France, 2IIIT-Hyderabad, 3Microsoft India

2 of 22

What is fMRI?

2

https://www.biopac.com/events/fmri-psych/

A vision-language task in the scanner

Concept + Picture

fMRI Brain Activity

3 of 22

Brain Encoding vs Decoding

3

Haiguang Wen et al, 2017

Encoding

Decoding

Stimulus

Representation

Stimulus

Representation

fMRI

fMRI

Concept + Picture

4 of 22

What is Brain Encoding?

4

Schrimpf et al. 2021 fMRI

Present

Stimulus

5 of 22

What is Brain Encoding?

5

Schrimpf et al. 2021 fMRI

Present

Stimulus

Present

Stimulus

6 of 22

What is Brain Encoding?

6

Schrimpf et al. 2021

Present

Stimulus

Stimulus

Ridge Regression

Present

Input

Output

X

Y

W

Pearson Correlation (R) = Corr(Y, W(X))

7 of 22

Most popular models are Transformers

7

Transformer language models

Vaswani et al. 2017, Dosovitskiy et al. 2021, Harold Li et al. 2019

Vision Transformer (ViT)

Multi-modal Transformer

8 of 22

Brain encoding for single-mode stimuli: Vision

8

Shrimpf et al. 2019, Shrimpf et al. 2019, Wang et al. 2019

9 of 22

9

Vaswani et al. 2017, Gauthier et al. 2019, Schrimpf et al. 2021

Transformer language models

(BERT, XLM, GPT,…)

Brain encoding for single-mode stimuli: Text

10 of 22

Can image-based and multi-model Transformers accurately perform fMRI encoding?

Dosovitskiy et al. 2021, Tan et al. 2019, Harold Li et al. 2019

11 of 22

Models used: Multi-Modal Transformers

CLIP

LXMERT

VisualBERT

Radford et al. 2021, Tan et al. 2019, Harold Li et al. 2019

12 of 22

Models used: Image Transformers

ViT

BEiT

DEiT

Dosovitskiy et al. 2021, Hangbo et al. 2021, Touvron et al. 2021

13 of 22

Models used: CNNs

VGGNET

RESNET50

InceptionV2

EfficientNET

Simonyan et al. 2014, He et al. 2016, Szegedy et al. 2017, Tan et 2019

14 of 22

Dataset Details

14

Pereira

Pereira et al. 2018, Nadine et al. 2019

BOLD5000

Concept+Picture (Bird)

15 of 22

Evaluation Metrics: 2V2 and Pearson

15

2V2 Accuracy

Cosine distance

Toneva et al. 2020

16 of 22

Encoding performance (BOLD5000)

17 of 22

Encoding performance (Pereira)

18 of 22

Model size vs Efficacy

Pereira

BOLD5000

19 of 22

Single Stream vs Dual Stream

Dual Stream

Single Stream

20 of 22

Is Linguistic Information Important in�Multi-Modal Transformers?

Randomize Image-Text pairs

Correct Image-Text pairs

21 of 22

Does Language Influence Vision?

22 of 22

Collaborators

Subba Reddy Oota

Jashn Arora

Vijay Rowtula

Manish Gupta

Bapi Raju Surampudi