☆ 4.7 Article

SAM: Modeling Scene, Object and Action With Semantics Attention Modules for Video Recognition

IEEE TRANSACTIONS ON MULTIMEDIA (2022)

期刊

IEEE TRANSACTIONS ON MULTIMEDIA

卷 24, 期 -, 页码 313-322

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TMM.2021.3050058

关键词

Video recognition; scene; object; feature fusion; semantics attention

类别

Computer Science, Information Systems Computer Science, Software Engineering Telecommunications

资金

National Key R&D Program of China [2018YFB1004300]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Video recognition aims to understand the semantic contents involving interactions between humans and related objects in specific scenes. The fusion of object, scene, and action features is commonly used to improve recognition accuracy. In this paper, the authors propose a method that breaks down the fusion of three features into two pairwise feature relation modeling processes, which helps overcome the challenge of correlation learning in high dimensional features. The proposed method achieves better results with less computational effort compared to alternative methods.

Video recognition aims at understanding semantic contents that normally involve the interactions of humans and related objects under certain scenes. A common practice to improve recognition accuracy is to combine object, scene and action features for classification directly, assuming that they are explicitly complementary. In this paper, we break down the fusion of three features into two pairwise feature relation modeling processes, which mitigates the difficulty of correlation learning in high dimensional features. Towards this goal, we introduce a Semantics Attention Module that captures the relations of a pair of features by refining the relatively weak feature with the guidance from the strong feature using attention mechanisms. The refined representation is further combined with the strong feature using a residual design for downstream tasks. Two SAMs are applied in a Semantics Attention Network (SAN) for improving video recognition. Extensive experiments are conducted on two large-scale video benchmarks, FCVID and ActivityNet v1.3-the proposed approach achieves better results while requiring much less computational effort than alternative methods.

SAM: Modeling Scene, Object and Action With Semantics Attention Modules for Video Recognition

期刊

IEEE TRANSACTIONS ON MULTIMEDIA

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

SAM: Modeling Scene, Object and Action With Semantics Attention Modules for Video Recognition

期刊

IEEE TRANSACTIONS ON MULTIMEDIA

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文