1 of 41

Visual Commonsense Generation & its incorporation into a Multimodal Topic Modeling algorithm

Aditya Chinchure, Sahithya Ravi, Felipe González-Pizarro

CPSC 532S - Final Presentation

2 of 41

Motivating Example:

Are these images from the same topic?

Commonsense reasoning may be useful in many vision & language tasks, e.g. topic models.

2

3 of 41

Our contributions:

VisualCOMET+ : A model that can generate commonsense inferences on provided image + textual cues.

Motivation: Useful for many downstream applications that are multi-modal such as topic modeling, summarization, story understanding & dialog
We are the first to investigate commonsense generation on object and event relations simultaneously

Multimodal CTM: We propose a multimodal topic modeling algorithm that takes as input texts, images, and inferences from VisualCOMET+.

Hypothesis: Considering images and VisualCOMET+ inferences will get a better representation of documents.
This is the first multimodal neural topic modeling algorithm.

3

4 of 41

Architectural Overview:

4

Multimodal CTM

VisualCOMET+

(Commonsense generation)

Whale watching tour…

… used for sailing

… capable of fun

Topic 1: boat, water, dock, floating

Topic 2: ocean, wet, surf, waves

Topic 3: people, group, gathered, around

5 of 41

Part-I: VisualCOMET+ for commonsense generation over images and text.

5

6 of 41

Related Work: VisualCOMET

6

VisualCOMET : VL transformer trained on 60K images

VisualCOMET: https://visualcomet.xyz (Park et al. ECCV 2020)

Relations:

Before
Because
After

7 of 41

VisualCOMET+ as an extension to VisualCOMET

7

Relations:

Before
Because
After
HasProperty
Indicates
HasContext
AtLocation

he is about to cut it

made of metal

kitchen

cutting

cutting board

he is cooking

used for cutting meat

killing

knife <HasProperty>

knife <HasContext>

knife <AtLocation>

Chef holding knife <Indicates>

We support more relations!

8 of 41

VisualCOMET+ Model

We finetune a pre-trained VL Transformer Decoder for sentence completion using a relation

8

Image

Feature

ROI

Feature

Visual Context

…

Tail Entity

Visual-Language Transformer Decoder

sailing

in

ocean

sailing

in

ocean

<END>

ROI

Feature

Boat

…

Head Entity

9 of 41

Datasets: Sherlock

9

Clue: Observed from the image marked with a bounding box

warm coats being worn

Horses wearing saddles and reins.

snow on the ground and in the trees

https://github.com/allenai/sherlock

10 of 41

Datasets: Sherlock

10

Clue: Observed from the image marked with a bounding box

warm coats being worn

Horses wearing saddles and reins.

snow on the ground and in the trees

Rationale: What does clue indicate?

Horses are used for riding.

It is cold outside.

It snowed recently

363K clue rationale pairs

https://github.com/allenai/sherlock

11 of 41

Clue Rationales ⇾ ConceptNet Paths

11

(non-ConceptNet)

Clue: Horses wearing saddles and reins.

Rationale:�Horses are used for horseback riding

Start with clue

rationales

race track

Shortest Path

Finding

riding

accessory for horseback riding

saddles

ConceptNet Node matching

https://conceptnet.io/

12 of 41

Clue Rationales ⇾ ConceptNet Paths

(non-ConceptNet)

Clue: Horses wearing saddles and reins.

Rationale:�Horses are used for horseback riding

Start with clue

rationales

race track

Shortest Path

Finding

riding

accessory for horseback riding

saddles

ConceptNet Node matching

13 of 41

Clue: Horses wearing saddles and reins.

Rationale: Horses are used for riding.

14 of 41

ConceptNet Paths ⇾ Triplets

14

Head Entity

Visual Context

Tail Entity

Horses wearing saddles and reigns.

Horses are used for riding.

Saddles

riding

Saddle

accessory for riding

....

50K triples with 80% for training, 10% for test & 10% for validation.

15 of 41

VisualCOMET+ Results

15

Model*	Test Set	Relations	BLEU-2	METEOR
VisualCOMET+	Sherlock	indicates, hasproperty, atlocation, hascontext	0.16	0.18

Matches VisualCOMET on their test set (0.18)

A celebration taking place.

This is in a bar.

People are drinking alcohol

bar

Table full of drinks

* Trained with same hyperparams as VisualCOMET: https://visualcomet.xyz (Park et al. ECCV 2020). See more ablations in report.

16 of 41

16

A celebration taking place.

This is in a bar.

People are drinking alcohol

bar

Table full of drinks

A celebration taking place.

This is in a bar.

People are drinking alcohol

bar

Table full of drinks

17 of 41

The person wearing this watch cares about being on time.

a watch on the wrist of a person

computing

is used for keeping track of time

time

wristwatch

the person is a criminal and the people are threatening them

a person surrounded by people pointing guns at them

they are a criminal

the person is a criminal and they are doing something wrong

the person is being held as hostage

<after>

18 of 41

kitchen

paper bag and plastic bags on table

someone left things in this bag after eating

is used for dishes

the person who owns this apartment is a messy person

<after>

The person who owns this apartment is a messy person.

someone left the food and the items here

kitchen

19 of 41

Part-II: Incorporating VisualCOMET+ in a downstream task (i.e, topic modeling)

19

We explored whether incorporating image features and VisualCOMET+ inferences will help us obtain a better representation of documents, and identify more coherent and diverse topics.

20 of 41

Contextualized Topic Model (CTM)

20

Slide based on https://silviatti.github.io/resources/alliancebernstein_30_10_20.pdf

It is based on variational autoencoders

Input: SBERT representation of documents

The encoder samples the topic document representation (hidden representation) from the learned parameters of the distribution

The top-words of a topic are obtained by the weight matrix that reconstruct the BOW

Contextualized Embeddings (e.g, SBERT)

21 of 41

Contribution: Multimodal Contextualized Topic Model

21

[1] Federico Bianchi, Silvia Terragni, Dirk Hovy, Debora Nozza, and Elisabetta Fersini. 2021. Cross-lingual Contextualized Topic Models with Zero-shot Learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1676–1683, Online. Association for Computational Linguistics

Text

A lovely bride and her groom at their wedding reception

22 of 41

Contribution: Multimodal Contextualized Topic Model

22

[1] Federico Bianchi, Silvia Terragni, Dirk Hovy, Debora Nozza, and Elisabetta Fersini. 2021. Cross-lingual Contextualized Topic Models with Zero-shot Learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1676–1683, Online. Association for Computational Linguistics

Hascontext: The couple married in India

A lovely bride and her groom at their wedding reception

Image

Text

VisualCOMET+

Inferences

Text

A lovely bride and her groom at their wedding reception

Contextualized Embeddings (e.g, CLIP)

23 of 41

Dataset: Visual Storytelling (VIST)

23

[1] Huang, T. H., Ferraro, F., Mostafazadeh, N., Misra, I., Agrawal, A., Devlin, J., ... & Mitchell, M. (2016, June). Visual storytelling. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 1233-1239).

81,743 unique photos in 20,211 sequences, aligned to both descriptive (caption) and story language
We use DII (Description-in-isolation) for generating commonsense and SIS (story-in-sequence) for topic modelling
This dataset is a proxy to social media datasets, where we have images and some text that is providing additional information to the image, like SIS.
Source: http://www.visionandlanguage.net/VIST/

24 of 41

Generating Inferences for VIST

Sample 17,000 instances of Image-caption-story pairs from the dataset
Use image’s caption for generating commonsense

Feed image feature of the full image (1 bounding box, containing the whole image) and the image’s caption as the textual cue.
Generate for the four new relations we added (“hascontext”, “atlocation”, “hasproperty”, “indicates”) → obtain one inference from each, and join the inference sentences.

24

Many colorful, cut flowers adorn the front of the shop.

The owner of the shop has a really great sense of style

This is a flower shop, located in the heart of the city.

This shop has a wide variety of flowers available for sale

Caption (DII attribute)

…

25 of 41

Topic Modeling: Images and Inferences improve topic model quality

25

Comparison of topics’ coherence and topics’ diversity between document representations. Each result averaged over 11 runs. We compute all the metrics for 25 topics.

We have also evaluated other hyperparameters such as different number of topics, and other contextualized embeddings (i.e., SBERT) → will be included in the report.

Multimodal CTM

26 of 41

Insights: Multimodal CTM Visualization

26

[1] Sievert, C., & Shirley, K. (2014, June). LDAvis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces (pp. 63-70).

We used an interactive topic modeling visualization tool to interpret topics and analyze the quality of our intermediate results.

27 of 41

Insights: Multimodal CTM

27

Topic #1: ['couple', 'wedding', 'bride', 'reception', 'wife', 'husband', 'love', 'two', 'guests', 'groom']

Topic #2 : ['speech', 'gave', 'working', 'talked', 'speaker', 'audience', 'class', 'hard', 'work', 'american']

Topic #3: ['little', 'kids', 'cake', 'birthday', 'girl', 'brought', 'grandma', 'candles', 'boy', 'excited']

28 of 41

Discussion

VisualCOMET+ :

Automatic metrics match VisualCOMET
Manual analysis reveals sensible inferences for most examples

Multimodal CTM:

Images & Inferences can result in more coherent and diverse topics
Analysis reveals useful topics!
This is the first Multimodal Neural Topic Model

We successfully integrated VisualCOMET+ with Multimodal CTM

VisualCOMET+ :

Inference diversity is limited ⇾ More diversity in training data is needed.
Inference is sometimes not accurate to the relation → more data/processing for this relation is needed.

Multimodal CTM:

VAE-Decoder is not reconstructing image features. This might boost the performance

28

Goal for the next week: i) Re-run our pipeline end-to-end with more and improved data from Sherlock, to report our final results. ii) Analyze the errors caused.

29 of 41

Thank You!

29

30 of 41

Related work (VisualCOMET related)

KM-BART: Knowledge Enhanced Multimodal BART for VisualCommonsense Generation
Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning
Stating the Obvious: Extracting Visual Common Sense Knowledge
VisualCOMET: Reasoning about the Dynamic Context of a Still Image
ConceptBert: Concept-Aware Representation for Visual Question Answering
Like hiking? You probably enjoy nature: Persona-grounded Dialog with Commonsense Expansions
Improving Question Answering by Commonsense-Based Pre-Training https://arxiv.org/pdf/1809.03568.pdf
VL-BERT: https://arxiv.org/pdf/1908.08530.pdf
From Recognition to Cognition: Visual Commonsense Reasoning https://arxiv.org/abs/1811.10830
The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning https://arxiv.org/abs/2202.04800

30

31 of 41

Related work (Topic Models)

Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence (Bianchi et al., ACL 2021)
Cross-lingual Contextualized Topic Models with Zero-shot Learning (Bianchi et al., EACL 2021)
Harrando, I., & Troncy, R. (2021, December). Discovering Interpretable Topics by Leveraging Common Sense Knowledge. In Proceedings of the 11th on Knowledge Capture Conference (pp. 265-268).
Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J., & Blei, D. (2009). Reading tea leaves: How humans interpret topic models. Advances in neural information processing systems, 22.
How do people talk about images? A study on open-domain conversations with images. (Chen et al., NAACL 2022)

31

32 of 41

Motivating Example:

32

33 of 41

33

34 of 41

Backup: Motivating Example:

34

VisualCOMET: What happens before cutting? �

What are the properties of the knife?

35 of 41

Backup: Evaluation methods

Topic modeling Evaluation

Topic coherence: Topic descriptors (e.g., keywords) must share some level of semantic relatedness

Normalized Pointwise Mutual Information (NPMI) (Lau et al., 2014)
External word embeddings topic coherence (WECO) (Ding et al., 2018)

Topic segregation: Topics should have little lexical/semantic overlap between them

Topic diversity (TD) (Zhang et al., 2022)
Inversed Rank-Biased Overlap (IRBO) (Terragni et al., 2021)

VisualCOMET+ Evaluation

Inference quality: Measuring N-gram overlap between triplets from ground truth rationales vs inferences

BLEU-2 (Papineni et al., 2002)
METEOR (Lavie, M.D.A. et.al, 2014)

35

We will compare against baselines, without images and generated inferences

36 of 41

Backup: Why knowledge generation?

Kai knew things were getting out of control and managed to keep his temper in check

36

X keeps X’s temper

X keeps under control

X sweats

X avoids a fight

X wants to show strength

X keeps X's

in check

Link to static

Knowledge Graph

Generate dynamic

graph

Kai intends to be calm

Kai is viewed as cautious

Kai stays calm

bad links

context-free knowledge

contextual knowledge

no linking

Kai wants to

avoid trouble

Note: This slide is recreated from Antoine’s tutorial: https://maartensap.com/acl2020-commonsense/

37 of 41

Backup: VisualCOMET

37

<after>

…

is

Gasp

for

Head entity

for

air

ROI

Feature

ROI

Feature

gasp

air

<END>

holding

onto

a

Visual Context

…

Language Model

Park et al., 2020

Tail entity

38 of 41

Backup: COMET for Commonsense

38

ConceptNet based COMET

Atomic based COMET

P(target words|seed words, relation)

COMET: https://arxiv.org/abs/1906.05317

39 of 41

Backup: Variational Autoencoder as Topic Model

39

Slide based on https://silviatti.github.io/resources/alliancebernstein_30_10_20.pdf

Input: Documents represented as Bag of Words (BOW)
The encoder samples the topic document representation (hidden representation) from the learned parameters of the distribution
The top-words of a topic are obtained by the weight matrix that reconstruct the BOW

40 of 41

Backup: Project Timeline

40

Nov 1 - Nov 8	Setup codebase for VisualCOMET+ based on VisualCOMET (Sahithya & Adi)�Setup codebase for CTM (Felipe)
Nov 8 - Nov 15	Focus on data-collection and cleanup for image-based commonsense (Sahithya & Adi)�Modify VisualCOMET+ to handle visual and text inputs as required (Adi & Sahithya)�Modify CTM code to handle visual, commonsense and text inputs (Felipe)
Nov 15 - Nov 22	Finetune VisualCOMET+ and evaluate (Sahithya, Adi, Felipe) Run the baseline CTM model without commonsense (Felipe)
Nov 22 - Dec 6 (2 weeks)	Build out a knowledge selection model and iterate (Sahithya)�Incorporate commonsense inferences into the CTM model (Adi, Felipe)
Dec 6 - Dec 12	Finalize our models and focus on collecting results and writing the report (Sahithya, Adi, Felipe)
Dec 12 - Dec 16	Presentations

41 of 41

Backup: Full Topic modeling Results

41