1 of 41

Visual Commonsense Generation & its incorporation into a Multimodal Topic Modeling algorithm

Aditya Chinchure, Sahithya Ravi, Felipe González-Pizarro

CPSC 532S - Final Presentation

2 of 41

Motivating Example:

Are these images from the same topic?

Commonsense reasoning may be useful in many vision & language tasks, e.g. topic models.

2

3 of 41

Our contributions:

  1. VisualCOMET+ : A model that can generate commonsense inferences on provided image + textual cues.
    • Motivation: Useful for many downstream applications that are multi-modal such as topic modeling, summarization, story understanding & dialog
    • We are the first to investigate commonsense generation on object and event relations simultaneously
  2. Multimodal CTM: We propose a multimodal topic modeling algorithm that takes as input texts, images, and inferences from VisualCOMET+.
    • Hypothesis: Considering images and VisualCOMET+ inferences will get a better representation of documents.
    • This is the first multimodal neural topic modeling algorithm.

3

4 of 41

Architectural Overview:

4

Multimodal CTM

VisualCOMET+

(Commonsense generation)

Whale watching tour…

<textual cue>

… used for sailing

… capable of fun

Topic 1: boat, water, dock, floating

Topic 2: ocean, wet, surf, waves

Topic 3: people, group, gathered, around

5 of 41

Part-I: VisualCOMET+ for commonsense generation over images and text.

5

6 of 41

Related Work: VisualCOMET

6

VisualCOMET : VL transformer trained on 60K images

VisualCOMET: https://visualcomet.xyz (Park et al. ECCV 2020)

Relations:

  1. Before
  2. Because
  3. After

7 of 41

VisualCOMET+ as an extension to VisualCOMET

7

Relations:

  • Before
  • Because
  • After
  • HasProperty
  • Indicates
  • HasContext
  • AtLocation

he is about to cut it

made of metal

kitchen

cutting

cutting board

he is cooking

used for cutting meat

killing

knife <HasProperty>

knife <HasContext>

knife <AtLocation>

Chef holding knife <Indicates>

We support more relations!

8 of 41

VisualCOMET+ Model

We finetune a pre-trained VL Transformer Decoder for sentence completion using a relation

8

Image

Feature

ROI

Feature

<relation>

Visual Context

Tail Entity

Visual-Language Transformer Decoder

<ROI ID>

<hascontext>

sailing

in

ocean

sailing

in

ocean

<END>

ROI

Feature

Boat

Head Entity

9 of 41

Datasets: Sherlock

9

Clue: Observed from the image marked with a bounding box

warm coats being worn

Horses wearing saddles and reins.

snow on the ground and in the trees

10 of 41

Datasets: Sherlock

10

Clue: Observed from the image marked with a bounding box

warm coats being worn

Horses wearing saddles and reins.

snow on the ground and in the trees

Rationale: What does clue indicate?

Horses are used for riding.

It is cold outside.

It snowed recently

363K clue rationale pairs

11 of 41

Clue Rationales ⇾ ConceptNet Paths

11

<indicates>

(non-ConceptNet)

Clue: Horses wearing saddles and reins.

Rationale:�Horses are used for horseback riding

Start with clue

rationales

race track

<usedfor>

<atlocation>

<hascontext>

<hasproperty>

Shortest Path

Finding

riding

accessory for horseback riding

saddles

ConceptNet Node matching

12 of 41

Clue Rationales ⇾ ConceptNet Paths

<indicates>

(non-ConceptNet)

Clue: Horses wearing saddles and reins.

Rationale:�Horses are used for horseback riding

Start with clue

rationales

race track

<usedfor>

<atlocation>

<hascontext>

<hasproperty>

Shortest Path

Finding

riding

accessory for horseback riding

saddles

ConceptNet Node matching

13 of 41

Clue: Horses wearing saddles and reins.

Rationale: Horses are used for riding.

14 of 41

ConceptNet Paths ⇾ Triplets

14

Head Entity

<relation>

Visual Context

Tail Entity

Horses wearing saddles and reigns.

<indicates>

Horses are used for riding.

Saddles

<usedfor>

riding

Saddle

<hasproperty>

accessory for riding

....

....

....

....

....

....

....

....

50K triples with 80% for training, 10% for test & 10% for validation.

15 of 41

VisualCOMET+ Results

15

Model*

Test Set

Relations

BLEU-2

METEOR

VisualCOMET+

Sherlock

indicates, hasproperty, atlocation, hascontext

0.16

0.18

Matches VisualCOMET on their test set (0.18)

A celebration taking place.

This is in a bar.

People are drinking alcohol

bar

Table full of drinks

<hascontext>

<hasproperty>

<indicates>

<atlocation>

* Trained with same hyperparams as VisualCOMET: https://visualcomet.xyz (Park et al. ECCV 2020). See more ablations in report.

16 of 41

16

A celebration taking place.

This is in a bar.

People are drinking alcohol

bar

Table full of drinks

<hascontext>

<hasproperty>

<indicates>

<atlocation>

A celebration taking place.

This is in a bar.

People are drinking alcohol

bar

Table full of drinks

<hascontext>

<hasproperty>

<indicates>

<atlocation>

17 of 41

The person wearing this watch cares about being on time.

a watch on the wrist of a person

<indicates>

computing

is used for keeping track of time

time

wristwatch

<hascontext>

<hasproperty>

<atlocation>

the person is a criminal and the people are threatening them

a person surrounded by people pointing guns at them

<indicates>

they are a criminal

the person is a criminal and they are doing something wrong

the person is being held as hostage

<intent>

<hasproperty>

<after>

18 of 41

kitchen

paper bag and plastic bags on table

<atlocation>

someone left things in this bag after eating

is used for dishes

the person who owns this apartment is a messy person

<intent>

<hasproperty>

<indicates>

<before>

<after>

<hascontext>

The person who owns this apartment is a messy person.

someone left the food and the items here

kitchen

19 of 41

Part-II: Incorporating VisualCOMET+ in a downstream task (i.e, topic modeling)

19

We explored whether incorporating image features and VisualCOMET+ inferences will help us obtain a better representation of documents, and identify more coherent and diverse topics.

20 of 41

Contextualized Topic Model (CTM)

20

  • It is based on variational autoencoders

  • Input: SBERT representation of documents

  • The encoder samples the topic document representation (hidden representation) from the learned parameters of the distribution

  • The top-words of a topic are obtained by the weight matrix that reconstruct the BOW

Contextualized Embeddings (e.g, SBERT)

21 of 41

Contribution: Multimodal Contextualized Topic Model

21

[1] Federico Bianchi, Silvia Terragni, Dirk Hovy, Debora Nozza, and Elisabetta Fersini. 2021. Cross-lingual Contextualized Topic Models with Zero-shot Learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1676–1683, Online. Association for Computational Linguistics

Text

A lovely bride and her groom at their wedding reception

22 of 41

Contribution: Multimodal Contextualized Topic Model

22

[1] Federico Bianchi, Silvia Terragni, Dirk Hovy, Debora Nozza, and Elisabetta Fersini. 2021. Cross-lingual Contextualized Topic Models with Zero-shot Learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1676–1683, Online. Association for Computational Linguistics

Hascontext: The couple married in India

A lovely bride and her groom at their wedding reception

Image

Text

VisualCOMET+

Inferences

Text

A lovely bride and her groom at their wedding reception

Contextualized Embeddings (e.g, CLIP)

23 of 41

Dataset: Visual Storytelling (VIST)

23

[1] Huang, T. H., Ferraro, F., Mostafazadeh, N., Misra, I., Agrawal, A., Devlin, J., ... & Mitchell, M. (2016, June). Visual storytelling. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 1233-1239).

  • 81,743 unique photos in 20,211 sequences, aligned to both descriptive (caption) and story language
  • We use DII (Description-in-isolation) for generating commonsense and SIS (story-in-sequence) for topic modelling
  • This dataset is a proxy to social media datasets, where we have images and some text that is providing additional information to the image, like SIS.
  • Source: http://www.visionandlanguage.net/VIST/

24 of 41

Generating Inferences for VIST

  1. Sample 17,000 instances of Image-caption-story pairs from the dataset
  2. Use image’s caption for generating commonsense
    1. Feed image feature of the full image (1 bounding box, containing the whole image) and the image’s caption as the textual cue.
    2. Generate for the four new relations we added (“hascontext”, “atlocation”, “hasproperty”, “indicates”) → obtain one inference from each, and join the inference sentences.

24

Many colorful, cut flowers adorn the front of the shop.

<hascontext>

The owner of the shop has a really great sense of style

<hasproperty>

<indicates>

This is a flower shop, located in the heart of the city.

This shop has a wide variety of flowers available for sale

Caption (DII attribute)

25 of 41

Topic Modeling: Images and Inferences improve topic model quality

25

Comparison of topics’ coherence and topics’ diversity between document representations. Each result averaged over 11 runs. We compute all the metrics for 25 topics.

We have also evaluated other hyperparameters such as different number of topics, and other contextualized embeddings (i.e., SBERT) → will be included in the report.

Multimodal CTM

26 of 41

Insights: Multimodal CTM Visualization

26

[1] Sievert, C., & Shirley, K. (2014, June). LDAvis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces (pp. 63-70).

We used an interactive topic modeling visualization tool to interpret topics and analyze the quality of our intermediate results.

27 of 41

Insights: Multimodal CTM

27

Topic #1: ['couple', 'wedding', 'bride', 'reception', 'wife', 'husband', 'love', 'two', 'guests', 'groom']

Topic #2 : ['speech', 'gave', 'working', 'talked', 'speaker', 'audience', 'class', 'hard', 'work', 'american']

Topic #3: ['little', 'kids', 'cake', 'birthday', 'girl', 'brought', 'grandma', 'candles', 'boy', 'excited']

28 of 41

Discussion

  1. VisualCOMET+ :
    1. Automatic metrics match VisualCOMET
    2. Manual analysis reveals sensible inferences for most examples
  2. Multimodal CTM:
    • Images & Inferences can result in more coherent and diverse topics
    • Analysis reveals useful topics!
    • This is the first Multimodal Neural Topic Model
  3. We successfully integrated VisualCOMET+ with Multimodal CTM

  • VisualCOMET+ :
    • Inference diversity is limited ⇾ More diversity in training data is needed.
    • Inference is sometimes not accurate to the relation → more data/processing for this relation is needed.
  • Multimodal CTM:
    • VAE-Decoder is not reconstructing image features. This might boost the performance

28

Goal for the next week: i) Re-run our pipeline end-to-end with more and improved data from Sherlock, to report our final results. ii) Analyze the errors caused.

29 of 41

Thank You!

29

30 of 41

Related work (VisualCOMET related)

  1. KM-BART: Knowledge Enhanced Multimodal BART for VisualCommonsense Generation
  2. Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning
  3. Stating the Obvious: Extracting Visual Common Sense Knowledge
  4. VisualCOMET: Reasoning about the Dynamic Context of a Still Image
  5. ConceptBert: Concept-Aware Representation for Visual Question Answering
  6. Like hiking? You probably enjoy nature: Persona-grounded Dialog with Commonsense Expansions
  7. Improving Question Answering by Commonsense-Based Pre-Training https://arxiv.org/pdf/1809.03568.pdf
  8. VL-BERT: https://arxiv.org/pdf/1908.08530.pdf
  9. From Recognition to Cognition: Visual Commonsense Reasoning https://arxiv.org/abs/1811.10830
  10. The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning https://arxiv.org/abs/2202.04800

30

31 of 41

Related work (Topic Models)

31

32 of 41

Motivating Example:

32

33 of 41

33

34 of 41

Backup: Motivating Example:

34

  • VisualCOMET: What happens before cutting? �

  • What are the properties of the knife?

35 of 41

Backup: Evaluation methods

Topic modeling Evaluation

  • Topic coherence: Topic descriptors (e.g., keywords) must share some level of semantic relatedness
    • Normalized Pointwise Mutual Information (NPMI) (Lau et al., 2014)
    • External word embeddings topic coherence (WECO) (Ding et al., 2018)

  • Topic segregation: Topics should have little lexical/semantic overlap between them
    • Topic diversity (TD) (Zhang et al., 2022)
    • Inversed Rank-Biased Overlap (IRBO) (Terragni et al., 2021)

VisualCOMET+ Evaluation

  • Inference quality: Measuring N-gram overlap between triplets from ground truth rationales vs inferences
    • BLEU-2 (Papineni et al., 2002)
    • METEOR (Lavie, M.D.A. et.al, 2014)

35

We will compare against baselines, without images and generated inferences

36 of 41

Backup: Why knowledge generation?

Kai knew things were getting out of control and managed to keep his temper in check

36

X keeps X’s temper

X keeps under control

X sweats

X avoids a fight

X wants to show strength

X keeps X's

in check

Link to static

Knowledge Graph

Generate dynamic

graph

Kai intends to be calm

Kai is viewed as cautious

Kai stays calm

bad links

context-free knowledge

contextual knowledge

no linking

Kai wants to

avoid trouble

Note: This slide is recreated from Antoine’s tutorial: https://maartensap.com/acl2020-commonsense/

37 of 41

Backup: VisualCOMET

37

<after>

<Person2>

is

Gasp

for

Head entity

for

air

ROI

Feature

ROI

Feature

gasp

<relation>

air

<END>

holding

onto

a

Visual Context

Language Model

Park et al., 2020

Tail entity

38 of 41

Backup: COMET for Commonsense

38

ConceptNet based COMET

Atomic based COMET

P(target words|seed words, relation)

39 of 41

Backup: Variational Autoencoder as Topic Model

39

  • Input: Documents represented as Bag of Words (BOW)
  • The encoder samples the topic document representation (hidden representation) from the learned parameters of the distribution
  • The top-words of a topic are obtained by the weight matrix that reconstruct the BOW

40 of 41

Backup: Project Timeline

40

Nov 1 - Nov 8

Setup codebase for VisualCOMET+ based on VisualCOMET (Sahithya & Adi)�Setup codebase for CTM (Felipe)

Nov 8 - Nov 15

Focus on data-collection and cleanup for image-based commonsense (Sahithya & Adi)�Modify VisualCOMET+ to handle visual and text inputs as required (Adi & Sahithya)�Modify CTM code to handle visual, commonsense and text inputs (Felipe)

Nov 15 - Nov 22

Finetune VisualCOMET+ and evaluate (Sahithya, Adi, Felipe)

Run the baseline CTM model without commonsense (Felipe)

Nov 22 - Dec 6

(2 weeks)

Build out a knowledge selection model and iterate (Sahithya)�Incorporate commonsense inferences into the CTM model (Adi, Felipe)

Dec 6 - Dec 12

Finalize our models and focus on collecting results and writing the report (Sahithya, Adi, Felipe)

Dec 12 - Dec 16

Presentations

41 of 41

Backup: Full Topic modeling Results

41