☆ 4.7 Article

Learning multiscale hierarchical attention for video summarization

PATTERN RECOGNITION (2022)

Journal

PATTERN RECOGNITION

Volume 122, Issue -, Pages -

Publisher

ELSEVIER SCI LTD

DOI: 10.1016/j.patcog.2021.108312

Keywords

Video summarization; Hierarchical structure; Attention models; Multiscale temporal representation; Two-stream framework

Funding

National Key Research and Development Program of China [2017YFA0700802]
National Natural Science Foundation of China [61822603, U1813218, U1713214, 61672306]
Shenzhen Fundamental Research Fund (Subject Arrangement) [JCYJ20170412170602564]
Shuimu Tsinghua Scholar Program [2021SM012]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper proposes a multiscale hierarchical attention approach for supervised video summarization, which leverages both the short-range and long-range temporal representations via intra-block and inter-block attention. The method integrates frame-level, block-level, and video-level representations to predict importance scores, conducts shot segmentation, computes shot-level scores, and performs key shot selection for producing video summaries. Furthermore, the two-stream framework incorporating appearance and motion information enhances the effectiveness of the method, as validated on the SumMe and TVSum datasets.

In this paper, we propose a multiscale hierarchical attention approach for supervised video summarization. Different from most existing supervised methods which employ bidirectional long short-term memory networks, our method exploits the underlying hierarchical structure of video sequences and learns both the short-range and long-range temporal representations via a intra-block and a inter-block attention. Specifically, we first separate each video sequence into blocks of equal length and employ the intrablock and inter-block attention to learn local and global information, respectively. Then, we integrate the frame-level, block-level, and video-level representations for the frame-level importance score prediction. Next, we conduct shot segmentation and compute shot-level importance scores. Finally, we perform key shot selection to produce video summaries. Moreover, we extend our method into a two-stream framework, where appearance and motion information is leveraged. Experimental results on the SumMe and TVSum datasets validate the effectiveness of our method against state-of-the-art methods. (c) 2021 Elsevier Ltd. All rights reserved.

Learning multiscale hierarchical attention for video summarization

Journal

PATTERN RECOGNITION

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Learning multiscale hierarchical attention for video summarization

Journal

PATTERN RECOGNITION

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper