4.6 Article

V2T: video to text framework using a novel automatic shot boundary detection algorithm

Journal

MULTIMEDIA TOOLS AND APPLICATIONS
Volume 81, Issue 13, Pages 17989-18009

Publisher

SPRINGER
DOI: 10.1007/s11042-022-12343-y

Keywords

Shot boundary detection; Illumination; Motion effect; Abrupt transition; Video captioning

Funding

  1. Scheme for Promotion of Academic and Research Collaboration (SPARC) under MHRD, Govt of India [P995, SPARC/2018-2019/119/SL]

Ask authors/readers for more resources

This paper presents a dual-stage approach for generating natural language descriptions for videos. The approach addresses the issue of redundancy caused by similar frames in videos by processing selected sets of frames and keyframes. The first stage involves a novel shot boundary detection approach to segment the video and select keyframes and frames. The second stage combines the extracted features with semantic concepts and uses a recurrent neural network for text generation. The proposed approach combines classical and modern computer vision techniques and has been validated on different datasets.
The generation of natural language descriptions for a video has been reported by many researchers till now. But, it is still the most interesting research topic among the researchers due to the emerging interdisciplinary problem of Computer Vision (CV), Natural Language Processing (NLP) and Deep Learning (DL). The results of a video description are still not convincing due to the redundancy of a large number of similar frames in a video. In this paper, we propose dual-stage based text generation approach in which the first stage is for reducing redundancy due to the similar frames by processing selected sets of frames and keyframe from the shots of a video and in the second stage, the text generator module will generate relevant text for a video using the selected sets of frames and keyframes of each shot. In the first stage, a flexible novel shot boundary detection (SBD or temporal boundaries) approach is proposed which will segment the video into shots and then keyframe and set of frames are selected from each shot using frame selection policy. Then, the spatio-temporal features for each segment and 2D features for each keyframe are extracted respectively using the 3D convolutional network and VGG19. These features are passed to the next stage where these features are embedded with semantic concepts related to video and then text generation will take place using Long Short Term Memory (LSTM) recurrent network. The proposed approach is the amalgamation of classical and modern computer vision techniques. In the first stage, the Noise-Resistant Local Binary Pattern (NRLBP) feature is used for detecting illumination and motion invariant temporal boundaries in a video and processing keyframes and sets of frames for the further text generation. TRECVid 2001 and 2007 datasets are used to validate the exactness of the proposed SBD approach and MSR-VTT (Microsoft Research Video to Text ) and YouTube2text (MSVD) datasets are applied to analyze and validate the performance of proposed video to text generation approach.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available