4.7 Article

SAM: Modeling Scene, Object and Action With Semantics Attention Modules for Video Recognition

Journal

IEEE TRANSACTIONS ON MULTIMEDIA
Volume 24, Issue -, Pages 313-322

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TMM.2021.3050058

Keywords

Video recognition; scene; object; feature fusion; semantics attention

Funding

  1. National Key R&D Program of China [2018YFB1004300]

Ask authors/readers for more resources

Video recognition aims to understand the semantic contents involving interactions between humans and related objects in specific scenes. The fusion of object, scene, and action features is commonly used to improve recognition accuracy. In this paper, the authors propose a method that breaks down the fusion of three features into two pairwise feature relation modeling processes, which helps overcome the challenge of correlation learning in high dimensional features. The proposed method achieves better results with less computational effort compared to alternative methods.
Video recognition aims at understanding semantic contents that normally involve the interactions of humans and related objects under certain scenes. A common practice to improve recognition accuracy is to combine object, scene and action features for classification directly, assuming that they are explicitly complementary. In this paper, we break down the fusion of three features into two pairwise feature relation modeling processes, which mitigates the difficulty of correlation learning in high dimensional features. Towards this goal, we introduce a Semantics Attention Module that captures the relations of a pair of features by refining the relatively weak feature with the guidance from the strong feature using attention mechanisms. The refined representation is further combined with the strong feature using a residual design for downstream tasks. Two SAMs are applied in a Semantics Attention Network (SAN) for improving video recognition. Extensive experiments are conducted on two large-scale video benchmarks, FCVID and ActivityNet v1.3-the proposed approach achieves better results while requiring much less computational effort than alternative methods.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available