4.7 Article

Frame Augmented Alternating Attention Network for Video Question Answering

Journal

IEEE TRANSACTIONS ON MULTIMEDIA
Volume 22, Issue 4, Pages 1032-1041

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TMM.2019.2935678

Keywords

Feature extraction; Visualization; Knowledge discovery; Task analysis; Data mining; Neural networks; Semantics; Video QA; alternating attention; augmented features; neural network

Funding

  1. National Key Research and Development Program of China [SQ2018AAA010010]
  2. National Natural Science Foundation of China [61751209, U1611461, 51605428]
  3. Hikvision-Zhejiang University Joint Research Center
  4. Zhejiang University-Tongdun Technology Joint Laboratory of Artificial Intelligence
  5. Chinese Knowledge Center of Engineering Science and Technology (CKCEST)
  6. Engineering Research Center of Digital Library
  7. Ministry of Education
  8. Zhejiang University iFLYTEK Joint Research Center

Ask authors/readers for more resources

Vision and language understanding is one of the most fundamental and challenging problems in Multimedia Intelligence. Simultaneously understanding video actions with a related natural language question, and further produces accurate answer is even more challenging since it requires joint modeling information across modality. In the past few years, some studies begin to attack this problem by utilizing attention enhanced deep neural networks. However, simple attention mechanisms such as unidirectional attention fail to yield a better mapping between different modalities. Moreover, none of these Video QA models explore high-level semantics in augmented video-frame level. In this paper, we augmented each frame representation with its context information by a novel feature extractor that combines the advantages of Resnet and a variant of C3D. In addition, we proposed a novel alternating attention network which can alternately attend frame regions, video frames and words in the question in multi-turns. This yields better joint representations of video and question, further help the deep model to discover the deeper relationship between two modalities. Our method outperforms the state-of-the-art Video QA models on two existing video question answering datasets. Further ablation studies proved that our feature extractor and the alternating attention mechanism can improve the performance jointly.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available