☆ 4.6 Article

Contextual and selective attention networks for image captioning

SCIENCE CHINA-INFORMATION SCIENCES (2022)

期刊

SCIENCE CHINA-INFORMATION SCIENCES

卷 65, 期 12, 页码 -

出版社

SCIENCE PRESS

DOI: 10.1007/s11432-020-3523-6

关键词

image captioning; hybrid attention; contextual attention

类别

Computer Science, Information Systems Engineering, Electrical & Electronic

资金

National Key Research and Development Program of China [2018AAA0102002]
National Natural Science Foundation of China [61732007]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper introduces a new design to explore the interdependencies between attention histories and emphasize the focus of each attention in image captioning. By memorizing contextual attention and extracting principal components from each attention, the proposed CoSA-Net achieves superior performance improvement.

The steady momentum of innovations has convincingly demonstrated the high capability of attention mechanisms for the sequence to sequence learning. Nevertheless, the computation of attention across a sequence is often independent in either hard or soft mode, thereby resulting in undesired effects such as repeated modeling. In this paper, we introduce a new design to holistically explore the interdependencies between attention histories and locally emphasize the strong focus of each attention on image captioning. Specifically, we present a contextual and selective attention network (namely CoSA-Net) that novelly memorizes contextual attention and brings out the principal components from each attention. Technically, CoSA-Net writes/updates the attended image region features into memory and reads from memory when measuring attention in the next time step to leverage contextual knowledge. Only the regions with the top-k highest attention scores are selected, and each region feature is individually employed to compute an output distribution. The final output is an attention-weighted mixture of all k distributions. In turn, the attention is then upgraded by the posterior distribution conditioned on the output. Our CoSA-Net is appealing given that it is pluggable to the sentence decoder in any neural captioning model. Extensive experiments on the COCO image captioning dataset demonstrate the superiority of CoSA-Net. More remarkably, integrating CoSA-Net to a one-layer long short-term memory (LSTM) decoder increases CIDEr-D performance from 125.2% to 128.5% on the COCO Karpathy test split. When further endowing a two-layer LSTM decoder with CoSA-Net, the CIDEr-D score is boosted to 129.5%.

Contextual and selective attention networks for image captioning

期刊

SCIENCE CHINA-INFORMATION SCIENCES

出版社

SCIENCE PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Contextual and selective attention networks for image captioning

期刊

SCIENCE CHINA-INFORMATION SCIENCES

出版社

SCIENCE PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文