4.7 Article

Text-instance graph: Exploring the relational semantics for text-based visual question answering

Journal

PATTERN RECOGNITION
Volume 124, Issue -, Pages -

Publisher

ELSEVIER SCI LTD
DOI: 10.1016/j.patcog.2021.108455

Keywords

Text-based visual question answering; Spatial overlapping; Text -Instance graph; Copy mechanism

Funding

  1. Tianshu AI Platform
  2. Zhejiang Lab
  3. National Key Research and Development Program of China [2018AAA0102200]
  4. National Natural Science Foundation of China [61772116, 61872064, 62020106008]
  5. Sichuan Science and Technology Program [2019JDTD0005]

Ask authors/readers for more resources

This study addresses the TextVQA problem and proposes a novel Text-Instance Graph (TIG) network to tackle the challenge. TIG models relationships between objects by building an OCR-OBJ graph and introduces a dynamic OCR-OBJ graph network to handle complex logic questions. Experimental results demonstrate the superior effectiveness of the proposed method compared to existing approaches.
It is time to stop neglecting the text around your world. In VQA, the surrounding text helps humans to understand complete visual scenes and reason question semantics efficiently. Here, we address the chal-lenging Text-based Visual Question Answering (TextVQA) problem, which requires a model to answer the VQA questions with text reading ability. Existing TextVQA methods mainly focus on the latent relation-ships between detected object instances and scene texts with the given question, but ignore spatial loca-tion relationships and complex relational semantics between visual object instances and OCR texts (e.g. the A of B on C). To deal with these challenges, we propose a novel Text-Instance Graph (TIG) network for TextVQA. The TIG builds an OCR-OBJ graph for overlapping relationships modeling, where each node of graph is updated by utilizing relative objects or OCR texts. To deal with the question with complex logic, we propose a dynamic OCR-OBJ graph network to extend the perception space of graph nodes, which grasps the information of non-directly adjacent node features. Considering a scene about the brand of the computer on the table, the model would build correlations between brand and table using the computer node as the intermediate node. Extensive experiments on three benchmarks demonstrate the effectiveness and superiority of the proposed method. In addition, our TIG achieves 0.505 ANLS on ST-VQA challenge leaderboard and sets a new state-of-the-art. (c) 2021 Published by Elsevier Ltd.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available