4.7 Article

Parallel Dense Video Caption Generation with Multi-Modal Features

期刊

MATHEMATICS
卷 11, 期 17, 页码 -

出版社

MDPI
DOI: 10.3390/math11173685

关键词

dense video caption; video captioning; multimodal feature fusion; feature extraction; neural network

向作者/读者索取更多资源

This work proposes a parallel-based dense video captioning method that can address the mutual constraint between event proposals and captions. It introduces a deformable Transformer framework to reduce or eliminate manual threshold of hyperparameters. Experimental results show that the proposed method outperforms other methods in this area, providing competitive results on the ActivityNet Caption dataset.
The task of dense video captioning is to generate detailed natural-language descriptions for an original video, which requires deep analysis and mining of semantic captions to identify events in the video. Existing methods typically follow a localisation-then-captioning sequence within given frame sequences, resulting in caption generation that is highly dependent on which objects have been detected. This work proposes a parallel-based dense video captioning method that can simultaneously address the mutual constraint between event proposals and captions. Additionally, a deformable Transformer framework is introduced to reduce or free manual threshold of hyperparameters in such methods. An information transfer station is also added as a representation organisation, which receives the hidden features extracted from a frame and implicitly generates multiple event proposals. The proposed method also adopts LSTM (Long short-term memory) with deformable attention as the main layer for caption generation. Experimental results show that the proposed method outperforms other methods in this area to a certain degree on the ActivityNet Caption dataset, providing competitive results.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据