☆ 4.7 Article

EAN: Event Adaptive Network for Enhanced Action Recognition

INTERNATIONAL JOURNAL OF COMPUTER VISION (2022)

Journal

INTERNATIONAL JOURNAL OF COMPUTER VISION

Volume 130, Issue 10, Pages 2453-2471

Publisher

SPRINGER

DOI: 10.1007/s11263-022-01661-1

Keywords

Action recognition; Dynamic neural networks; Vision transformers; Motion representation

Funding

NSFC [U19B2035, 61831015]
National Key R &D Program of China [2021YFE0206700]
Shanghai Municipal Science and Technology Major Project [2021SHZDZX0102]
CAAI-Huawei MindSpore Open Fund

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Efficiently modeling spatial-temporal information in videos for action recognition is crucial. This paper proposes a unified action recognition framework that adapts to the dynamic nature of video content by introducing dynamic-scale spatial-temporal kernels and sparse interactions among selected foreground objects. The framework also includes a novel latent motion code module to improve performance.

Efficiently modeling spatial-temporal information in videos is crucial for action recognition. To achieve this goal, state-of-the-art methods typically employ the convolution operator and the dense interaction modules such as non-local blocks. However, these methods cannot accurately fit the diverse events in videos. On the one hand, the adopted convolutions are with fixed scales, thus struggling with events of various scales. On the other hand, the dense interaction modeling paradigm only achieves sub-optimal performance as action-irrelevant parts bring additional noises for the final prediction. In this paper, we propose a unified action recognition framework to investigate the dynamic nature of video content by introducing the following designs. First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer, which yields a sparse paradigm. We call the proposed framework as Event Adaptive Network because both key designs are adaptive to the input video content. To exploit the short-term motions within local segments, we propose a novel and efficient Latent Motion Code module, further improving the performance of the framework. Extensive experiments on several large-scale video datasets, e.g., Something-to-Something V1 &V2, Kinetics, and Diving48, verify that our models achieve state-of-the-art or competitive performances at low FLOPs. Codes are available at: https://github.com/tianyuan168326/EAN-Pytorch.

EAN: Event Adaptive Network for Enhanced Action Recognition

Journal

INTERNATIONAL JOURNAL OF COMPUTER VISION

Publisher

SPRINGER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

EAN: Event Adaptive Network for Enhanced Action Recognition

Journal

INTERNATIONAL JOURNAL OF COMPUTER VISION

Publisher

SPRINGER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper