☆ 4.7 Article

An attention based dual learning approach for video captioning

APPLIED SOFT COMPUTING (2022)

Journal

APPLIED SOFT COMPUTING

Volume 117, Issue -, Pages -

Publisher

ELSEVIER

DOI: 10.1016/j.asoc.2021.108332

Keywords

Attention mechanism; Deep neural network; Dual learning; Encoder-decoder; Video captioning

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Video captioning is an important task in multimedia processing, and traditional approaches only utilize visual information to generate captions. This paper proposes a novel attention based dual learning approach (ADL) that improves the quality of video captions by minimizing the differences between generated and raw videos.

Video captioning aims to generate sentences/captions to describe video contents. It is one of the key tasks in the field of multimedia processing. However, most of the current video captioning approaches utilize only the visual information of a video to generate captions. Recently, a new encoder-decoderreconstructor architecture was developed for video captioning, which can capture the information in both raw videos and the generated captions through dual learning. Based on this architecture, this paper proposes a novel attention based dual learning approach (ADL) for video captioning. Specifically, ADL is composed of a caption generation module and a video reconstruction module. The caption generation module builds a translatable mapping between raw video frames and the generated video captions, i.e., using the visual features extracted from videos by an Inception-V4 network to produce video captions. Then the video reconstruction module reproduces raw video frames using the generated video captions, i.e., using the hidden states of the decoder in the caption generation module to reproduce/synthesize raw visual features. A multi-head attention mechanism is adopted to help the two modules focus on the most effective information in videos and captions, and a dual learning mechanism is adopted to fine-tune the performance of the two modules to generate final video captions. Therefore, ADL can minimize the semantic gap between raw videos and the generated captions by minimizing the differences between the reproduced and the raw videos, thereby improving the quality of the generated video captions. Experimental results demonstrate that ADL is superior to the state-of-the-art video captioning approaches on benchmark datasets. (C) 2021 Published by Elsevier B.V.

An attention based dual learning approach for video captioning

Journal

APPLIED SOFT COMPUTING

Publisher

ELSEVIER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

An attention based dual learning approach for video captioning

Journal

APPLIED SOFT COMPUTING

Publisher

ELSEVIER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper