VisToT: Vision-Augmented Table-to-Text Generation
https://vl2g.github.io/projects/vistot/
AI for India @ AI-ML Systems’23
Prajwal Gatti, Anand Mishra, Manish Gupta, Mithun Das Gupta
Can we combine information in tables and images to generate rich text descriptions?
2
Table-to-Text Generation
3
Conventional Table-to-Text Method:
“Lough Leane is a 4700 acre estate in Killarney, County Kerry, Ireland.”
[Lebret et al. EMNLP’16; Novikova et al. SIGDIAL’17; Parikh et al. EMNLP’20]
Vision-Augmented Table-to-Text Generation
4
Conventional Table-to-Text Method:
“Lough Leane is a 4700 acre estate in Killarney, County Kerry, Ireland.”
Vision-augmented Table-to-Text (Our work):
“Lough Leane is a large lake in Killarney, County Kerry, Ireland.”
Vision-Augmented Table-to-Text Generation
5
Conventional Table-to-Text Method:
“Lough Leane is a 4700 acre estate in Killarney, County Kerry, Ireland.”
Vision-augmented Table-to-Text (Our work):
“Lough Leane is a large lake in Killarney, County Kerry, Ireland.”
Ground Truth:
“Lough Leane is the largest of the three lakes in Killarney, County Kerry, Ireland.”
Vision-Augmented Table-to-Text Generation
6
Given a table T and an image I about an entity
Generate a text summary S describing the entity using �T and I as the source context.
We introduce WikiLandmarks
A new dataset containing tables and images for� 73K world landmarks
7
Image
Table
Text summary
8
Name | Amitabha Drukpa |
Country | Nepal |
Location | Kathmandu |
Dedicated To | Amitabha |
“Amitabha Monastery is a Tibetan Buddhist Monastery
in Nepal”
8
8
Image
Table
Text summary
Name | Michigan Stadium |
Location | 1201 South Main Street Ann Arbor, Michigan |
Owner | University of Michigan |
Nickname | The Big House |
“Michigan Stadium, nicknamed The Big House, is the football stadium for the University of Michigan in
Ann Arbor, Michigan”
9
9
Image
Table
Text summary
Name | Niesen |
Elevation | 2,632 m |
Prominence | 407 m |
Location | Canton of Bern, Switzerland |
Parent Range | Bernese Alps |
“The Niesen is a mountain peak of the Bernese Alps in the Canton of Bern, Switzerland”.
10
10
11
Colosseum, Italy
Castelão, Brazil
Mysore Palace, India
St. Louis Cathedral, USA
WikiLandmarks is geodiverse!
Belukha Mountain, Russia
12
Dataset Statistics
We introduce VT3
A multimodal transformer for the VisToT problem
13
VT3: Visual-Tabular Data-to-Text Transformer
Key
Value
Name
Lough Leane
Location
Killarney, County Kerry
Coordinates
58°2’30’’N 9°33’0’’W
Basin countries
Ireland
Surface Area
4,700 acres
Islands
Innisfallen
Table
Image
BART
Encoder
[NAME] Lough Leane [Location] Killar…
Swin
Transformer
…
Tokenize
BART
Decoder
“Lough Leane is a large lake in Killarney, …”
“[bos] Lough Leane is a large lake ”
Generated Sentence
14
15
VT3: Visual-Tabular Data-to-Text Transformer
We introduce three pre-training objectives:
16
Benchmarking on WikiLandmarks
VT3 outperforms both SoTA table-to-text and image captioning methods
17
Effect of Visual Encoders
Swin yields superior results than other ViTs as well as an object-centric vision encoder (FRCNN)
Capitol Park Historic District
Capital Park Detroit MI
Capitol Park, from the north
Detroit, Michigan, U.S.
46.64611°’N 7.6525°’W
1877
Albert Kahn Associates et al.
Italianate, Romanesque Revival
March 18, 1999
Michigan State Historic Site
Name
Image
Caption
Location
Coordinates
Built
Architect
Architecture
Added
Designated
18
Result
VT3 (w/o vision):
Capitol Park is a park in Downtown Detroit, Michigan, United States.
�VT3:
Capitol Park Historic District is a commercial historic district located in downtown Detroit, Michigan. ��Ground Truth:
The Capitol Park Historic District is a historic district located in downtown Detroit, Michigan.
19
Thank You
Visit our project page for Code and Dataset
https://vl2g.github.io/projects/vistot/
Supported by Microsoft Academic Partnership Grant 2022
Appendix
20
Appendix
21
Appendix
22
Appendix
23
Appendix
24
Appendix
25