☆ 4.6 Article

Stacked cross-modal feature consolidation attention networks for image captioning

MULTIMEDIA TOOLS AND APPLICATIONS (2023)

Journal

MULTIMEDIA TOOLS AND APPLICATIONS

Volume -, Issue -, Pages -

Publisher

SPRINGER

DOI: 10.1007/s11042-023-15869-x

Keywords

Contextual representation; Cross-modal feature fusion; Image captioning; Stacked attention network; Visual and semantic information

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper proposes a stacked cross-modal feature consolidation (SCFC) attention network for image captioning, which combines high-level semantic concepts and visual information to generate fine-grained captions.

The attention-enriched encoder-decoder framework has recently aroused great interest in image captioning due to its overwhelming progress. Many visual attention models directly leverage meaningful regions to generate image descriptions. However, seeking a direct transition from visual space to text is not enough to generate fine-grained captions. This paper exploits a feature-compounding approach to bring together high-level semantic concepts and visual information regarding the contextual environment fully end-to-end. Thus, we propose a stacked cross-modal feature consolidation (SCFC) attention network for image captioning in which we simultaneously consolidate cross-modal features through a novel compounding function in a multi-step reasoning fashion. Besides, we jointly employ spatial information and context-aware attributes (CAA) as the principal components in our proposed compounding function, where our CAA provides a concise context-sensitive semantic representation. To better use consolidated features potential, we propose an SCFC-LSTM as the caption generator, which can leverage discriminative semantic information through the caption generation process. The experimental results indicate that our proposed SCFC can outperform various state-of-the-art image captioning benchmarks in terms of popular metrics on the MSCOCO and Flickr30K datasets.

Stacked cross-modal feature consolidation attention networks for image captioning

Journal

MULTIMEDIA TOOLS AND APPLICATIONS

Publisher

SPRINGER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Stacked cross-modal feature consolidation attention networks for image captioning

Journal

MULTIMEDIA TOOLS AND APPLICATIONS

Publisher

SPRINGER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper