Table 2: Ablation for VT3 model with different Visual Encoders.
Table 1: Performance comparison on the WikiLandmarks test set.
We thank Microsoft for supporting this work through Microsoft Academic Partnership Grant (MAPG) 2021
Looking at the Table-to-Text problem from a multimodal lens
VisToT Task
WikiLandmarks Dataset
VT3: Vision-Tabular Data to Text Transformer
Paper, code, and dataset
available here:
https://vl2g.github.io/projects/vistot/
We propose the task of VisToT, a vision-augmented extension to the table-to-text problem.
We introduce WikiLandmarks dataset to study VisToT task.
We present VT3, a multimodal transformer for solving VisToT.
“Lough Leane is a large lake in Killarney, County Kerry, Ireland.”
Tables contain a structured list of facts, images are a rich source of unstructured visual information.
VisToT proposes use of information from both modalities to generate a meaningful text description.
Summary
Experiments
Name | Amitabha Drukpa |
Country | Nepal |
Location | Kathmandu |
Dedicated To | Amitabha |
“Amitabha Monastery is a Tibetan Buddhist Monastery in Nepal”
“Michigan Stadium, nicknamed The Big House, is the football stadium for the University of Michigan in Ann
Arbor, Michigan”
Name | Michigan Stadium |
Location | 1201 South Main Street Ann Arbor, Michigan |
Owner | University of Michigan |
Nickname | The Big House |
VisToT: Vision-Augmented Table-to-Text Generation
Prajwal Gatti1, Anand Mishra1, Manish Gupta2, Mithun Das Gupta2
1Indian Institute of Technology Jodhpur, 2Microsoft
Pretrained-BART
Encoder
Pretrained-BART
Decoder
Swin
Transformer
[NAME] Lough Leane [Location] Killar…
“Lough Leane is a large lake…”
[bos] Lough Leane is a large
Image
We also propose three pre-training objectives:
Par
-liament
Gardens
are
park
…
Building
…
Attention Visualization
during text generation
Given a table T describing an entity E and an associated image I, the goal is to generate a sentence description S such that it accurately describes E using the source context of T and I.
VisToT can be applicable in domains such as tourism, healthcare and e-commerce.