☆ 4.7 Article

Frame Augmented Alternating Attention Network for Video Question Answering

IEEE TRANSACTIONS ON MULTIMEDIA (2020)

Journal

IEEE TRANSACTIONS ON MULTIMEDIA

Volume 22, Issue 4, Pages 1032-1041

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TMM.2019.2935678

Keywords

Feature extraction; Visualization; Knowledge discovery; Task analysis; Data mining; Neural networks; Semantics; Video QA; alternating attention; augmented features; neural network

Funding

National Key Research and Development Program of China [SQ2018AAA010010]
National Natural Science Foundation of China [61751209, U1611461, 51605428]
Hikvision-Zhejiang University Joint Research Center
Zhejiang University-Tongdun Technology Joint Laboratory of Artificial Intelligence
Chinese Knowledge Center of Engineering Science and Technology (CKCEST)
Engineering Research Center of Digital Library
Ministry of Education
Zhejiang University iFLYTEK Joint Research Center

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Vision and language understanding is one of the most fundamental and challenging problems in Multimedia Intelligence. Simultaneously understanding video actions with a related natural language question, and further produces accurate answer is even more challenging since it requires joint modeling information across modality. In the past few years, some studies begin to attack this problem by utilizing attention enhanced deep neural networks. However, simple attention mechanisms such as unidirectional attention fail to yield a better mapping between different modalities. Moreover, none of these Video QA models explore high-level semantics in augmented video-frame level. In this paper, we augmented each frame representation with its context information by a novel feature extractor that combines the advantages of Resnet and a variant of C3D. In addition, we proposed a novel alternating attention network which can alternately attend frame regions, video frames and words in the question in multi-turns. This yields better joint representations of video and question, further help the deep model to discover the deeper relationship between two modalities. Our method outperforms the state-of-the-art Video QA models on two existing video question answering datasets. Further ablation studies proved that our feature extractor and the alternating attention mechanism can improve the performance jointly.

Frame Augmented Alternating Attention Network for Video Question Answering

Journal

IEEE TRANSACTIONS ON MULTIMEDIA

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Frame Augmented Alternating Attention Network for Video Question Answering

Journal

IEEE TRANSACTIONS ON MULTIMEDIA

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper