☆ 4.7 Article

Video representation learning for temporal action detection using global-local attention

PATTERN RECOGNITION (2022)

Journal

PATTERN RECOGNITION

Volume 134, Issue -, Pages -

Publisher

ELSEVIER SCI LTD

DOI: 10.1016/j.patcog.2022.109135

Keywords

Temporal action detection; Video representation; Untrimmed video analysis

Funding

National Natural Sci- ence Foundation of China
Key Research and Development Program in the Shaanxi Province of China
Natural Science Basic Research Program of Shaanxi
[61976167]
[U19B2030]
[62101416]
[2021GY082]
[2022JQ- 708]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Video representation is crucial for temporal action detection, with different requirements for action classification and action localization. This paper proposes a Global-Local Attention (GLA) mechanism to produce a powerful video representation without additional parameters. GLA enhances the discriminability and localization ability of video representation through global and local attention mechanisms, achieving state-of-the-art performance.

Video representation is of significant importance for temporal action detection. The two sub-tasks of temporal action detection, i.e., action classification and action localization, have different requirements for video representation. Specifically, action classification requires video representations to be highly dis-criminative, so that action features and background features are as dissimilar as possible. For action local-ization, it is crucial to obtain information about the action itself and the surrounding context for accurate prediction of action boundaries. However, the previous methods failed to extract the optimal representa-tions for the two sub-tasks, whose representations for both sub-tasks are obtained in a similar way. In this paper, a Global-Local Attention (GLA) mechanism is proposed to produce a more powerful video rep-resentation for temporal action detection without introducing additional parameters. The global attention mechanism predicts each action category by integrating features in the entire video that are similar to the action while suppressing other features, thus enhancing the discriminability of video representation during the training process. The local attention mechanism uses a Gaussian weighting function to inte-grate each action and its surrounding contextual information, thereby enabling precise localization of the action. The effectiveness of GLA is demonstrated on THUMOS'14 and ActivityNet-1.3 with a simple one -stage action detection network, achieving state-of-the-art performance among the methods using only RGB images as input. The inference speed of the proposed model reaches 1373 FPS on a single Nvidia Titan Xp GPU. The generalizability of GLA to other detection architectures is verified using R-C3D and Decouple-SSAD, both of which achieve consistent improvements. The experimental results demonstrate that designing representations with different properties for the two sub-tasks leads to better performance for temporal action detection compared to the representations obtained in a similar way.(c) 2022 Elsevier Ltd. All rights reserved.

Video representation learning for temporal action detection using global-local attention

Journal

PATTERN RECOGNITION

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Video representation learning for temporal action detection using global-local attention

Journal

PATTERN RECOGNITION

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper