☆ 4.7 Article

Divergent-convergent attention for image captioning

PATTERN RECOGNITION (2021)

Journal

PATTERN RECOGNITION

Volume 115, Issue -, Pages -

Publisher

ELSEVIER SCI LTD

DOI: 10.1016/j.patcog.2021.107928

Keywords

Image Captioning; Divergent Observation; Convergent Attention

Funding

National Natural Science Foundation of China [61906007, 61672065]
Beijing Municipal Science and Technology Project [KM202110005022]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

A novel divergent-convergent attention (DCA) model is proposed to address the issues in current attention-based image captioning methods. By utilizing multi-perspective inputs and adaptive attention merging, the model achieves more precise focus on local image regions and generates more descriptive sentences. The interaction between visual and semantic components contributes to the model's superior performance on the MS COCO dataset.

Attention mechanism has made great progress in image captioning, where semantic words or local regions are selectively embedded into the language model. However, current attention-based image captioning methods ignore the fine-grained semantic information and their interaction with visual regions. Inspired by the activity of human in describing an image: divergent observation and convergent attention, we propose a novel divergent-convergent attention (DCA) model to tackle the problems of the current attention model in image captioning. In our DCA model, divergent observation is mainly reflected in the multi-perspective inputs: a visual collection coming from object detection and three semantic components of scene graph made of objects, attributes and relations respectively. Then the convergent attention merges these multi-perspective inputs by adaptively deciding which perspective is crucial and which element in the focused perspective dominates in the attention process through a hierarchical structure. Our model also makes use of the interaction between visual objects and semantic components to achieve complementary advantages. Above all, owing to the interaction between divergent visual and semantic components, and the gradual convergence of attention, our model can attend to the corresponding local region more precisely under the guidance of semantic components. Besides, with the assistance of the visual components, the DCA model can effectively utilize the fine-grained semantic components to generate more descriptive sentences. Experiments on the MS COCO dataset demonstrate the superiority of our proposed method. (c) 2021 Elsevier Ltd. All rights reserved.

Divergent-convergent attention for image captioning

Journal

PATTERN RECOGNITION

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Divergent-convergent attention for image captioning

Journal

PATTERN RECOGNITION

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper