4.8 Article

An attention-based hybrid deep learning approach for bengali video captioning

出版社

ELSEVIER
DOI: 10.1016/j.jksuci.2022.11.015

关键词

Bengali video captioning; Convolutional neural network; Encoder-decoder model; Recurrent neural network; Attention-mechanism

向作者/读者索取更多资源

Video captioning is an automated process that generates captions for videos by understanding their content. This research focuses on Bengali video captioning, which is an underexplored area compared to English video captioning. The study implements sequence-to-sequence models like LSTM, BiLSTM, and GRU combined with CNN models VGG-19, Inceptionv3, and ResNet50v2 to extract video frame features and generate textual descriptions. Attention mechanism is also incorporated for the first time in Bengali video captioning. A novel Bengali video captioning dataset is created from Microsoft Research Video Description Corpus (MSVD) dataset, and the model's performance is evaluated using popular metrics such as BLEU, METEOR, and ROUGE. The proposed attention-based hybrid model outperforms existing models and sets a new benchmark for Bengali video captioning.
Video captioning is an automated process of captioning a video by understanding the content within it. Although numerous studies have been performed on video captioning in English, the field of video cap-tioning in Bengali remains nearly unexplored. Therefore, this research aims at generating Bengali cap-tions that plausibly describe the gist of a specific video as well as identifying the best performing model for Bengali video captioning. To accomplish this, several sequence-to-sequence models - LSTM, BiLSTM, and GRU are implemented that takes the video frame features as input, extracted through differ-ent CNN models - VGG-19, Inceptionv3, and ResNet50v2, and provides a corresponding textual descrip-tion as output. Moreover, the Attention mechanism is incorporated with these models as a first-ever attempt in Bengali video captioning. In this study, a novel Bengali video captioning dataset is constructed from Microsoft Research Video Description Corpus (MSVD) dataset (an English video captioning dataset) through utilizing a deep learning-based translator and manual post-editing efforts. Finally, the model's performance is evaluated in terms of popular performance evaluation metrics -BLEU, METEOR, and ROUGE. The proposed attention-based hybrid model outperforms the existing models in terms of these evaluation metrics, establishing a new benchmark for Bengali video captioning. (c) 2022 The Author(s). Published by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据