4.7 Article

Divergent-convergent attention for image captioning

Journal

PATTERN RECOGNITION
Volume 115, Issue -, Pages -

Publisher

ELSEVIER SCI LTD
DOI: 10.1016/j.patcog.2021.107928

Keywords

Image Captioning; Divergent Observation; Convergent Attention

Funding

  1. National Natural Science Foundation of China [61906007, 61672065]
  2. Beijing Municipal Science and Technology Project [KM202110005022]

Ask authors/readers for more resources

A novel divergent-convergent attention (DCA) model is proposed to address the issues in current attention-based image captioning methods. By utilizing multi-perspective inputs and adaptive attention merging, the model achieves more precise focus on local image regions and generates more descriptive sentences. The interaction between visual and semantic components contributes to the model's superior performance on the MS COCO dataset.
Attention mechanism has made great progress in image captioning, where semantic words or local regions are selectively embedded into the language model. However, current attention-based image captioning methods ignore the fine-grained semantic information and their interaction with visual regions. Inspired by the activity of human in describing an image: divergent observation and convergent attention, we propose a novel divergent-convergent attention (DCA) model to tackle the problems of the current attention model in image captioning. In our DCA model, divergent observation is mainly reflected in the multi-perspective inputs: a visual collection coming from object detection and three semantic components of scene graph made of objects, attributes and relations respectively. Then the convergent attention merges these multi-perspective inputs by adaptively deciding which perspective is crucial and which element in the focused perspective dominates in the attention process through a hierarchical structure. Our model also makes use of the interaction between visual objects and semantic components to achieve complementary advantages. Above all, owing to the interaction between divergent visual and semantic components, and the gradual convergence of attention, our model can attend to the corresponding local region more precisely under the guidance of semantic components. Besides, with the assistance of the visual components, the DCA model can effectively utilize the fine-grained semantic components to generate more descriptive sentences. Experiments on the MS COCO dataset demonstrate the superiority of our proposed method. (c) 2021 Elsevier Ltd. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available