Visual Commonsense Generation & its incorporation into a Multimodal Topic Modeling algorithm
Aditya Chinchure, Sahithya Ravi, Felipe González-Pizarro
CPSC 532S - Final Presentation
Motivating Example:
Are these images from the same topic?
Commonsense reasoning may be useful in many vision & language tasks, e.g. topic models.
2
Our contributions:
3
Architectural Overview:
4
Multimodal CTM
VisualCOMET+
(Commonsense generation)
Whale watching tour…
<textual cue>
… used for sailing
… capable of fun
Topic 1: boat, water, dock, floating
Topic 2: ocean, wet, surf, waves
Topic 3: people, group, gathered, around
Part-I: VisualCOMET+ for commonsense generation over images and text.
5
Related Work: VisualCOMET
6
VisualCOMET : VL transformer trained on 60K images
VisualCOMET: https://visualcomet.xyz (Park et al. ECCV 2020)
Relations:
VisualCOMET+ as an extension to VisualCOMET
7
Relations:
he is about to cut it
made of metal
kitchen
cutting
cutting board
he is cooking
used for cutting meat
killing
knife <HasProperty>
knife <HasContext>
knife <AtLocation>
Chef holding knife <Indicates>
We support more relations!
VisualCOMET+ Model
We finetune a pre-trained VL Transformer Decoder for sentence completion using a relation
8
Image
Feature
ROI
Feature
<relation>
Visual Context
…
Tail Entity
Visual-Language Transformer Decoder
<ROI ID>
<hascontext>
sailing
in
ocean
sailing
in
ocean
<END>
ROI
Feature
Boat
…
Head Entity
Datasets: Sherlock
9
Clue: Observed from the image marked with a bounding box
warm coats being worn
Horses wearing saddles and reins.
snow on the ground and in the trees
Datasets: Sherlock
10
Clue: Observed from the image marked with a bounding box
warm coats being worn
Horses wearing saddles and reins.
snow on the ground and in the trees
Rationale: What does clue indicate?
Horses are used for riding.
It is cold outside.
It snowed recently
363K clue rationale pairs
Clue Rationales ⇾ ConceptNet Paths
11
<indicates>
(non-ConceptNet)
Clue: Horses wearing saddles and reins.
Rationale:�Horses are used for horseback riding
Start with clue
rationales
race track
<usedfor>
<atlocation>
<hascontext>
<hasproperty>
Shortest Path
Finding
riding
accessory for horseback riding
saddles
ConceptNet Node matching
Clue Rationales ⇾ ConceptNet Paths
<indicates>
(non-ConceptNet)
Clue: Horses wearing saddles and reins.
Rationale:�Horses are used for horseback riding
Start with clue
rationales
race track
<usedfor>
<atlocation>
<hascontext>
<hasproperty>
Shortest Path
Finding
riding
accessory for horseback riding
saddles
ConceptNet Node matching
Clue: Horses wearing saddles and reins.
Rationale: Horses are used for riding.
ConceptNet Paths ⇾ Triplets
14
Head Entity
<relation>
Visual Context
Tail Entity
Horses wearing saddles and reigns.
<indicates>
Horses are used for riding.
Saddles
<usedfor>
riding
Saddle
<hasproperty>
accessory for riding
....
....
....
....
....
....
....
....
50K triples with 80% for training, 10% for test & 10% for validation.
VisualCOMET+ Results
15
Model* | Test Set | Relations | BLEU-2 | METEOR |
VisualCOMET+ | Sherlock | indicates, hasproperty, atlocation, hascontext | 0.16 | 0.18 |
Matches VisualCOMET on their test set (0.18)
A celebration taking place.
This is in a bar.
People are drinking alcohol
bar
Table full of drinks
<hascontext>
<hasproperty>
<indicates>
<atlocation>
* Trained with same hyperparams as VisualCOMET: https://visualcomet.xyz (Park et al. ECCV 2020). See more ablations in report.
16
A celebration taking place.
This is in a bar.
People are drinking alcohol
bar
Table full of drinks
<hascontext>
<hasproperty>
<indicates>
<atlocation>
A celebration taking place.
This is in a bar.
People are drinking alcohol
bar
Table full of drinks
<hascontext>
<hasproperty>
<indicates>
<atlocation>
The person wearing this watch cares about being on time.
a watch on the wrist of a person
<indicates>
computing
is used for keeping track of time
time
wristwatch
<hascontext>
<hasproperty>
<atlocation>
the person is a criminal and the people are threatening them
a person surrounded by people pointing guns at them
<indicates>
they are a criminal
the person is a criminal and they are doing something wrong
the person is being held as hostage
<intent>
<hasproperty>
<after>
kitchen
paper bag and plastic bags on table
<atlocation>
someone left things in this bag after eating
is used for dishes
the person who owns this apartment is a messy person
<intent>
<hasproperty>
<indicates>
<before>
<after>
<hascontext>
The person who owns this apartment is a messy person.
someone left the food and the items here
kitchen
Part-II: Incorporating VisualCOMET+ in a downstream task (i.e, topic modeling)
19
We explored whether incorporating image features and VisualCOMET+ inferences will help us obtain a better representation of documents, and identify more coherent and diverse topics.
Contextualized Topic Model (CTM)
20
Contextualized Embeddings (e.g, SBERT)
Contribution: Multimodal Contextualized Topic Model
21
[1] Federico Bianchi, Silvia Terragni, Dirk Hovy, Debora Nozza, and Elisabetta Fersini. 2021. Cross-lingual Contextualized Topic Models with Zero-shot Learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1676–1683, Online. Association for Computational Linguistics
Text
A lovely bride and her groom at their wedding reception
Contribution: Multimodal Contextualized Topic Model
22
[1] Federico Bianchi, Silvia Terragni, Dirk Hovy, Debora Nozza, and Elisabetta Fersini. 2021. Cross-lingual Contextualized Topic Models with Zero-shot Learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1676–1683, Online. Association for Computational Linguistics
Hascontext: The couple married in India
A lovely bride and her groom at their wedding reception
Image
Text
VisualCOMET+
Inferences
Text
A lovely bride and her groom at their wedding reception
Contextualized Embeddings (e.g, CLIP)
Dataset: Visual Storytelling (VIST)
23
[1] Huang, T. H., Ferraro, F., Mostafazadeh, N., Misra, I., Agrawal, A., Devlin, J., ... & Mitchell, M. (2016, June). Visual storytelling. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 1233-1239).
Generating Inferences for VIST
24
Many colorful, cut flowers adorn the front of the shop.
<hascontext>
The owner of the shop has a really great sense of style
<hasproperty>
<indicates>
This is a flower shop, located in the heart of the city.
This shop has a wide variety of flowers available for sale
Caption (DII attribute)
…
Topic Modeling: Images and Inferences improve topic model quality
25
Comparison of topics’ coherence and topics’ diversity between document representations. Each result averaged over 11 runs. We compute all the metrics for 25 topics.
We have also evaluated other hyperparameters such as different number of topics, and other contextualized embeddings (i.e., SBERT) → will be included in the report.
Multimodal CTM
Insights: Multimodal CTM Visualization
26
[1] Sievert, C., & Shirley, K. (2014, June). LDAvis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces (pp. 63-70).
We used an interactive topic modeling visualization tool to interpret topics and analyze the quality of our intermediate results.
Insights: Multimodal CTM
27
Topic #1: ['couple', 'wedding', 'bride', 'reception', 'wife', 'husband', 'love', 'two', 'guests', 'groom']
Topic #2 : ['speech', 'gave', 'working', 'talked', 'speaker', 'audience', 'class', 'hard', 'work', 'american']
Topic #3: ['little', 'kids', 'cake', 'birthday', 'girl', 'brought', 'grandma', 'candles', 'boy', 'excited']
Discussion
28
Goal for the next week: i) Re-run our pipeline end-to-end with more and improved data from Sherlock, to report our final results. ii) Analyze the errors caused.
Thank You!
29
Related work (VisualCOMET related)
30
Related work (Topic Models)
31
Motivating Example:
32
33
Backup: Motivating Example:
34
Backup: Evaluation methods
Topic modeling Evaluation
VisualCOMET+ Evaluation
35
We will compare against baselines, without images and generated inferences
Backup: Why knowledge generation?
Kai knew things were getting out of control and managed to keep his temper in check
36
X keeps X’s temper
X keeps under control
X sweats
X avoids a fight
X wants to show strength
X keeps X's
in check
Link to static
Knowledge Graph
Generate dynamic
graph
Kai intends to be calm
Kai is viewed as cautious
Kai stays calm
bad links
context-free knowledge
contextual knowledge
no linking
Kai wants to
avoid trouble
Note: This slide is recreated from Antoine’s tutorial: https://maartensap.com/acl2020-commonsense/
Backup: VisualCOMET
37
<after>
…
<Person2>
is
Gasp
for
Head entity
for
air
ROI
Feature
ROI
Feature
gasp
<relation>
air
<END>
holding
onto
a
Visual Context
…
Language Model
Park et al., 2020
Tail entity
Backup: COMET for Commonsense
38
ConceptNet based COMET
Atomic based COMET
P(target words|seed words, relation)
Backup: Variational Autoencoder as Topic Model
39
Backup: Project Timeline
40
Nov 1 - Nov 8 | Setup codebase for VisualCOMET+ based on VisualCOMET (Sahithya & Adi)�Setup codebase for CTM (Felipe) |
Nov 8 - Nov 15 | Focus on data-collection and cleanup for image-based commonsense (Sahithya & Adi)�Modify VisualCOMET+ to handle visual and text inputs as required (Adi & Sahithya)�Modify CTM code to handle visual, commonsense and text inputs (Felipe) |
Nov 15 - Nov 22 | Finetune VisualCOMET+ and evaluate (Sahithya, Adi, Felipe) Run the baseline CTM model without commonsense (Felipe) |
Nov 22 - Dec 6 (2 weeks) | Build out a knowledge selection model and iterate (Sahithya)�Incorporate commonsense inferences into the CTM model (Adi, Felipe) |
Dec 6 - Dec 12 | Finalize our models and focus on collecting results and writing the report (Sahithya, Adi, Felipe) |
Dec 12 - Dec 16 | Presentations |
Backup: Full Topic modeling Results
41