期刊
JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES
卷 35, 期 1, 页码 257-269出版社
ELSEVIER
DOI: 10.1016/j.jksuci.2022.11.015
关键词
Bengali video captioning; Convolutional neural network; Encoder-decoder model; Recurrent neural network; Attention-mechanism
Video captioning is an automated process that generates captions for videos by understanding their content. This research focuses on Bengali video captioning, which is an underexplored area compared to English video captioning. The study implements sequence-to-sequence models like LSTM, BiLSTM, and GRU combined with CNN models VGG-19, Inceptionv3, and ResNet50v2 to extract video frame features and generate textual descriptions. Attention mechanism is also incorporated for the first time in Bengali video captioning. A novel Bengali video captioning dataset is created from Microsoft Research Video Description Corpus (MSVD) dataset, and the model's performance is evaluated using popular metrics such as BLEU, METEOR, and ROUGE. The proposed attention-based hybrid model outperforms existing models and sets a new benchmark for Bengali video captioning.
Video captioning is an automated process of captioning a video by understanding the content within it. Although numerous studies have been performed on video captioning in English, the field of video cap-tioning in Bengali remains nearly unexplored. Therefore, this research aims at generating Bengali cap-tions that plausibly describe the gist of a specific video as well as identifying the best performing model for Bengali video captioning. To accomplish this, several sequence-to-sequence models - LSTM, BiLSTM, and GRU are implemented that takes the video frame features as input, extracted through differ-ent CNN models - VGG-19, Inceptionv3, and ResNet50v2, and provides a corresponding textual descrip-tion as output. Moreover, the Attention mechanism is incorporated with these models as a first-ever attempt in Bengali video captioning. In this study, a novel Bengali video captioning dataset is constructed from Microsoft Research Video Description Corpus (MSVD) dataset (an English video captioning dataset) through utilizing a deep learning-based translator and manual post-editing efforts. Finally, the model's performance is evaluated in terms of popular performance evaluation metrics -BLEU, METEOR, and ROUGE. The proposed attention-based hybrid model outperforms the existing models in terms of these evaluation metrics, establishing a new benchmark for Bengali video captioning. (c) 2022 The Author(s). Published by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据