4.6 Article

Layer-wise enhanced transformer with multi-modal fusion for image caption

相关参考文献

注意:仅列出部分参考文献,下载原文获取全部文献信息。
Article Automation & Control Systems

Sequential Transformer via an Outside-In Attention for image captioning

Yiwei Wei et al.

Summary: This study introduces an Outside-in Attention mechanism to address the limitations of recurrent attention and self attention in image captioning tasks. By incorporating the advantages of both transformer and recurrent network, competitive results are achieved.

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE (2022)

Article Computer Science, Artificial Intelligence

Dual Global Enhanced Transformer for image captioning

Tiantao Xian et al.

Summary: This paper proposes Dual Global Enhanced Transformer (DGET) to incorporate global information in image captioning. DGET adaptively fuses visual global information using a novel Global Enhanced Encoder (GEE) and explicitly utilizes textual global information with a Global Enhanced Decoder (GED), achieving superior performance on the MS COCO dataset.

NEURAL NETWORKS (2022)

Proceedings Paper Computer Science, Artificial Intelligence

LaTr: Layout-Aware Transformer for Scene-Text VQA

Ali Furkan Biten et al.

Summary: We propose a novel multimodal architecture called LaTr for Scene Text Visual Question Answering (STVQA). The importance of the language module, especially when enriched with layout information, is revealed. Our single objective pre-training scheme that only requires text and spatial cues improves model performance on scanned documents and enhances robustness towards OCR errors. By leveraging a vision transformer and performing vocabulary-free decoding, LaTr outperforms existing STVQA methods on multiple datasets.

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) (2022)

Article Engineering, Electrical & Electronic

Multimodal Transformer With Multi-View Visual Representation for Image Captioning

Jun Yu et al.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2020)

Proceedings Paper Computer Science, Artificial Intelligence

Show, Edit and Tell: A Framework for Editing Image Captions

Fawaz Sammani et al.

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) (2020)

Article Chemistry, Multidisciplinary

Boosted Transformer for Image Captioning

Jiangyun Li et al.

APPLIED SCIENCES-BASEL (2019)

Article Chemistry, Multidisciplinary

Captioning Transformer with Stacked Attention Modules

Xinxin Zhu et al.

APPLIED SCIENCES-BASEL (2018)

Article Computer Science, Artificial Intelligence

Deep Visual-Semantic Alignments for Generating Image Descriptions

Andrej Karpathy et al.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2017)

Article Computer Science, Artificial Intelligence

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Shaoqing Ren et al.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2017)