☆ 3.8 Proceedings Paper

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 (2021)

期刊

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021

卷 -, 期 -, 页码 8747-8757

出版社

IEEE COMPUTER SOC

DOI: 10.1109/CVPR46437.2021.00864

关键词

类别

Computer Science, Artificial Intelligence Imaging Science & Photographic Technology

资金

NSF [IIS-1704337, IIS-1813709]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

The paper proposes a Text-Aware Pre-training (TAP) method for Text-VQA and Text-Caption tasks, which incorporates scene text during pretraining to improve aligned representation learning among text word, visual object, and scene text modalities. Pre-trained on a large-scale OCR-CC dataset, the approach outperforms the state of the art by large margins on multiple tasks.

In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks. These two tasks aim at reading and understanding scene text in images for question answering and image caption generation, respectively. In contrast to conventional vision-language pretraining that fails to capture scene text and its relationship with the visual and text modalities, TAP explicitly incorporates scene text (generated from OCR engines) during pretraining. With three pre-training tasks, including masked language modeling (MLM), image-text (contrastive) matching (ITM), and relative (spatial) position prediction (RPP), pre-training with scene text effectively helps the model learn a better aligned representation among the three modalities: text word, visual object, and scene text. Due to this aligned representation learning, even pre-trained on the same downstream task dataset, TAP already boosts the absolute accuracy on the TextVQA dataset by +5.4%, compared with a non-TAP baseline. To further improve the performance, we build a large-scale scene text-related imagetext dataset based on the Conceptual Caption dataset, named OCR-CC, which contains 1.4 million images with scene text. Pre-trained on this OCR-CC dataset, our approach outperforms the state of the art by large margins on multiple tasks, i.e., +8.3% accuracy on TextVQA, +8.6% accuracy on ST-VQA, and +10.2 CIDEr score on TextCaps.

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

期刊

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

期刊

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文