4.7 Article

Learning multiscale hierarchical attention for video summarization

Journal

PATTERN RECOGNITION
Volume 122, Issue -, Pages -

Publisher

ELSEVIER SCI LTD
DOI: 10.1016/j.patcog.2021.108312

Keywords

Video summarization; Hierarchical structure; Attention models; Multiscale temporal representation; Two-stream framework

Funding

  1. National Key Research and Development Program of China [2017YFA0700802]
  2. National Natural Science Foundation of China [61822603, U1813218, U1713214, 61672306]
  3. Shenzhen Fundamental Research Fund (Subject Arrangement) [JCYJ20170412170602564]
  4. Shuimu Tsinghua Scholar Program [2021SM012]

Ask authors/readers for more resources

This paper proposes a multiscale hierarchical attention approach for supervised video summarization, which leverages both the short-range and long-range temporal representations via intra-block and inter-block attention. The method integrates frame-level, block-level, and video-level representations to predict importance scores, conducts shot segmentation, computes shot-level scores, and performs key shot selection for producing video summaries. Furthermore, the two-stream framework incorporating appearance and motion information enhances the effectiveness of the method, as validated on the SumMe and TVSum datasets.
In this paper, we propose a multiscale hierarchical attention approach for supervised video summarization. Different from most existing supervised methods which employ bidirectional long short-term memory networks, our method exploits the underlying hierarchical structure of video sequences and learns both the short-range and long-range temporal representations via a intra-block and a inter-block attention. Specifically, we first separate each video sequence into blocks of equal length and employ the intrablock and inter-block attention to learn local and global information, respectively. Then, we integrate the frame-level, block-level, and video-level representations for the frame-level importance score prediction. Next, we conduct shot segmentation and compute shot-level importance scores. Finally, we perform key shot selection to produce video summaries. Moreover, we extend our method into a two-stream framework, where appearance and motion information is leveraged. Experimental results on the SumMe and TVSum datasets validate the effectiveness of our method against state-of-the-art methods. (c) 2021 Elsevier Ltd. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available