☆ 4.7 Article

A multimodal attention fusion network with a dynamic vocabulary for TextVQA

PATTERN RECOGNITION (2022)

Journal

PATTERN RECOGNITION

Volume 122, Issue -, Pages -

Publisher

ELSEVIER SCI LTD

DOI: 10.1016/j.patcog.2021.108214

Keywords

Dynamic vocabulary; Attention map; Multimodal fusion; ST-VQA

Funding

National Key Research and Development Program of China [2020AAA0107900]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Visual question answering (VQA) is a significant problem in computer vision, where text-based VQA tasks have gained increasing attention. This study introduces an attention-based encoder-decoder network that integrates multimodal information of visual, linguistic, and location features, enhancing model performance and efficiency through attention mechanism and attention map loss.

Visual question answering (VQA) is a well-known problem in computer vision. Recently, Text-based VQA tasks are getting more and more attention because text information is very important for image understanding. The key to this task is to make good use of text information in the image. In this work, we propose an attention-based encoder-decoder network that combines the multimodal information of visual, linguistic, and location features together. By using the attention mechanism to focus on key features to the question, our multimodal feature fusion can provide more accurate information to improve the performance. Furthermore, we present a decoder with attention map loss, which can not only predict complex answers but also deal with a dynamic vocabulary to reduce the decoding space. Compared with softmax-based cross entropy loss which can only handle a fixed-length vocabulary, the attention map loss significantly improves the accuracy and efficiency. Our method achieved the first place of all three tasks in the ICDAR2019 robust reading challenge on scene text visual question answering (ST-VQA). (c) 2021 Elsevier Ltd. All rights reserved.

A multimodal attention fusion network with a dynamic vocabulary for TextVQA

Journal

PATTERN RECOGNITION

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

A multimodal attention fusion network with a dynamic vocabulary for TextVQA

Journal

PATTERN RECOGNITION

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper