1 of 1

  • Tables and Images for 73K+ world landmarks.
  • Each sample contains a table, image, and a text summary.
  • Table and text summaries are obtained from Wikipedia.
  • Images contain visually inferable facts –
    • Type of landmark (e.g., Church, Castle)
    • Architecture (e.g., Ancient Roman, Mughal),
    • Composition (e.g., White Marble, Bronze), and many more.

Table 2: Ablation for VT3 model with different Visual Encoders.

Table 1: Performance comparison on the WikiLandmarks test set.

We thank Microsoft for supporting this work through Microsoft Academic Partnership Grant (MAPG) 2021

Looking at the Table-to-Text problem from a multimodal lens

VisToT Task

WikiLandmarks Dataset

VT3: Vision-Tabular Data to Text Transformer

Paper, code, and dataset

available here:

https://vl2g.github.io/projects/vistot/

We propose the task of VisToT, a vision-augmented extension to the table-to-text problem.

We introduce WikiLandmarks dataset to study VisToT task.

We present VT3, a multimodal transformer for solving VisToT.

“Lough Leane is a large lake in Killarney, County Kerry, Ireland.”

Tables contain a structured list of facts, images are a rich source of unstructured visual information.

VisToT proposes use of information from both modalities to generate a meaningful text description.

Summary

Experiments

Name

Amitabha Drukpa

Country

Nepal

Location

Kathmandu

Dedicated To

Amitabha

“Amitabha Monastery is a Tibetan Buddhist Monastery in Nepal”

“Michigan Stadium, nicknamed The Big House, is the football stadium for the University of Michigan in Ann

Arbor, Michigan”

Name

Michigan Stadium

Location

1201 South Main Street

Ann Arbor, Michigan

Owner

University of Michigan

Nickname

The Big House

VisToT: Vision-Augmented Table-to-Text Generation

Prajwal Gatti1, Anand Mishra1, Manish Gupta2, Mithun Das Gupta2

1Indian Institute of Technology Jodhpur, 2Microsoft

Pretrained-BART

Encoder

Pretrained-BART

Decoder

Swin

Transformer

[NAME] Lough Leane [Location] Killar…

“Lough Leane is a large lake…”

[bos] Lough Leane is a large

Image

We also propose three pre-training objectives:

  1. Image-Table Matching (ITabM),
  2. Masked Value Modeling (MVM), and
  3. Image Captioning (IC).

Par

-liament

Gardens

are

park

Building

Attention Visualization

during text generation

Given a table T describing an entity E and an associated image I, the goal is to generate a sentence description S such that it accurately describes E using the source context of T and I.

VisToT can be applicable in domains such as tourism, healthcare and e-commerce.