4.7 Article

Adaptive Spatial Location With Balanced Loss for Video Captioning

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TCSVT.2020.3045735

关键词

Task analysis; Redundancy; Feature extraction; Visualization; Detectors; Computer vision; Training; Convolutional neural network; recurrent neural network; video captioning; adaptive spatial location; balanced loss

资金

  1. National Nature Science Foundation of China [61525206, 61871004]
  2. 242 Project [2020A077]
  3. National Natural Science Foundation of China (NSFC)-General Technology Fundamental Research Joint Fund [U1836215]
  4. China Postdoctoral Science Foundation [2020TQ0055]
  5. Foundation of Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), Ministry of Education

向作者/读者索取更多资源

This paper proposes an adaptive spatial location module for the video captioning task to reduce spatial and temporal redundancy in videos and improve the accuracy of generated descriptions. It also introduces a balanced loss function to address class imbalance and generate more diverse description sentences. Extensive experiments show that the proposed method achieves competitive performance compared to state-of-the-art methods.
Many pioneering approaches have verified the effectiveness of utilizing the global temporal and local object information for video understanding tasks and have achieved significant progress. However, existing methods utilize object detectors to extract all objects overall video frames. This may bring performance degradation due to the information redundancy both spatially and temporally. To address this problem, we propose an adaptive spatial location module for the video captioning task which dynamically predicts an important position of each video frame in the procedure of generating the description sentence. The proposed adaptive spatial location method not only makes our model focus on local object information, but also reduces time and memory consumption brought by the temporal redundancy in extensive video frames and improves the accuracy of generated description. Besides, we propose a balanced loss function to address the class imbalance problem existing in training data. The proposed balanced loss assigns different weight to each word of ground-truth sentence in the training process which can generate more diversified description sentences. Extensive experimental results on the MSVD and MSR-VTT dataset show that the proposed method achieves competitive performance compared to state-of-the-art methods.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据