Published using Google Docs
Reasoning over Text in Images for VQA
Updated automatically every 5 minutes

Commons Project Description

Reasoning over Text in Images for VQA


In this work, we address the TextVQA problem. Unlike the traditional VQA task which mostly focuses on objects and often ignores text in the image, TextVQA explicitly requires reading and understanding text in images to derive an answer. For example, to answer “what is the speed limit” in the right picture, a TextVQA model needs to understand the question, look at the traffic sign, and read the number 75 on the sign to give “75 mph” as the answer.

We propose the M4C model to address these challenges. We handle three input modalities, including question, visual objects, and OCR tokens, by embedding them into a joint representation space and applying a multimodal transformer over it. Compared to previous work, our M4C model achieves over 25 percent relative improvement on three benchmark datasets for the TextVQA task.



Many visual scenes contain text that carries crucial information, and it is thus essential to understand text in images for downstream reasoning tasks. For example, a deep water label on a warning sign warns people about the danger in the scene. Recent work has explored the TextVQA task that requires reading and understanding text in images to answer a question. However, existing approaches for TextVQA are mostly based on custom pairwise fusion mechanisms between a pair of two modalities and are restricted to a single prediction step by casting TextVQA as a classification task.

An overview of our proposed M4C model

In this work, we propose a novel model for the TextVQA task based on a multimodal transformer architecture accompanied by a rich representation for text in images. Our model naturally fuses different modalities homogeneously by embedding them into a common semantic space where self-attention is applied to model inter- and intra- modality context. Furthermore, it enables iterative answer decoding with a dynamic pointer network, allowing the model to form an answer through multi-step prediction instead of one-step classification.

Our results suggest that it is efficient to handle multiple modalities through domain-specific embedding followed by homogeneous self-attention and to generate complex answers as multi-step decoding instead of one-step classification.

Performance of our model on three benchmark datasets for the TextVQA task

In summary our contributions in this work are as follows