Layer-wise enhanced transformer with multi-modal fusion for image caption

Article Automation & Control Systems

Sequential Transformer via an Outside-In Attention for image captioning

Yiwei Wei et al.

Summary: This study introduces an Outside-in Attention mechanism to address the limitations of recurrent attention and self attention in image captioning tasks. By incorporating the advantages of both transformer and recurrent network, competitive results are achieved.

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE (2022)

添加到收藏夹

Article Computer Science, Artificial Intelligence

Dual Global Enhanced Transformer for image captioning

Tiantao Xian et al.

Summary: This paper proposes Dual Global Enhanced Transformer (DGET) to incorporate global information in image captioning. DGET adaptively fuses visual global information using a novel Global Enhanced Encoder (GEE) and explicitly utilizes textual global information with a Global Enhanced Decoder (GED), achieving superior performance on the MS COCO dataset.

NEURAL NETWORKS (2022)

添加到收藏夹

Proceedings Paper Computer Science, Artificial Intelligence

LaTr: Layout-Aware Transformer for Scene-Text VQA

Ali Furkan Biten et al.

Summary: We propose a novel multimodal architecture called LaTr for Scene Text Visual Question Answering (STVQA). The importance of the language module, especially when enriched with layout information, is revealed. Our single objective pre-training scheme that only requires text and spatial cues improves model performance on scanned documents and enhances robustness towards OCR errors. By leveraging a vision transformer and performing vocabulary-free decoding, LaTr outperforms existing STVQA methods on multiple datasets.

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) (2022)

添加到收藏夹

Article Engineering, Electrical & Electronic