4.7 Article

Parallel Dense Video Caption Generation with Multi-Modal Features

Related references

Note: Only part of the references are listed.
Article Computer Science, Artificial Intelligence

Video description: A comprehensive survey of deep learning approaches

Ghazala Rafiq et al.

Summary: Video description is about understanding visual content and converting it into automatic textual narration. By combining computer vision and natural language processing, it has practical applications in real-time scenarios. Deep learning-based approaches have shown better results than conventional methods. This paper focuses on deep learning-enabled automatic caption generation, specifically on the sequence to sequence techniques.

ARTIFICIAL INTELLIGENCE REVIEW (2023)

Article Chemistry, Analytical

Fusion of Multi-Modal Features to Enhance Dense Video Caption

Xuefei Huang et al.

Summary: In this paper, a fusion model that integrates both visual and audio features in the video for captioning is proposed. The model combines the Transformer framework, multi-head attention, and a Common Pool to handle variations in sequence lengths and filter information. LSTM is used as a decoder to generate description sentences, reducing network memory size. Experimental results demonstrate the competitiveness of the method on the ActivityNet Captions dataset.

SENSORS (2023)

Article Computer Science, Information Systems

Hybrid Motion Model for Multiple Object Tracking in Mobile Devices

Yubin Wu et al.

Summary: Traditional intelligent transportation systems can only track objects through fixed surveillance cameras, but with the advent of the Internet of Things, object tracking has become more challenging. To address this issue, we propose a hybrid motion model to improve the tracking accuracy on mobile devices.

IEEE INTERNET OF THINGS JOURNAL (2023)

Article Computer Science, Software Engineering

Light field super-resolution using complementary-view feature attention

Wei Zhang et al.

Summary: In this paper, a novel network called LF-CFANet is proposed to improve LF super-resolution by dynamically learning the complementary information in LF views. This network utilizes a residual complementary-view spatial and channel attention module (RCSCAM) to effectively interact with the complementary information between views. The proposed LF-CFANet outperforms state-of-the-art methods in terms of reconstruction performance and SR accuracy on both synthetic and real-world datasets.

COMPUTATIONAL VISUAL MEDIA (2023)

Article Engineering, Electrical & Electronic

Syntax-Guided Hierarchical Attention Network for Video Captioning

Jincan Deng et al.

Summary: In this paper, a syntax-guided hierarchical attention network (SHAN) is proposed to generate video captions by integrating visual and sentence-context features. Experimental results demonstrate that the proposed method achieves comparable performance with current methods.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2022)

Article Computer Science, Artificial Intelligence

Event-centric multi-modal fusion method for dense video captioning

Zhi Chang et al.

Summary: The study aims to generate event proposals using visual-audio cues, enhance event-level representations, capture relationships between events, and fuse multi-modal information through an attention-gating mechanism, proposing an event-centric multi-modal fusion approach for dense video captioning. Experimental results demonstrate significant progress on the ActivityNet Caption and YouCook2 datasets.

NEURAL NETWORKS (2022)

Article Engineering, Electrical & Electronic

Blindly Assess Quality of In-the-Wild Videos via Quality-Aware Pre-Training and Motion Perception

Bowen Li et al.

Summary: Perceptual quality assessment of videos acquired in the wild is crucial for ensuring the quality of video services. This study proposes a model-based transfer learning approach that combines image quality assessment and action recognition to effectively evaluate target video databases.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2022)

Article Computer Science, Information Systems

Parallel Pathway Dense Video Captioning With Deformable Transformer

Wangyu Choi et al.

Summary: In this paper, a parallel pathway dense video captioning framework is proposed, which can localize and describe events simultaneously without depending on the output of preceding modules. By introducing a representation organization network and an event localizer at the branching point of the parallel pathway, and considering the fluency and coherence of sentences during sentence generation, the proposed method outperforms existing algorithms.

IEEE ACCESS (2022)

Article Computer Science, Artificial Intelligence

Extendable Multiple Nodes Recurrent Tracking Framework With RTU plus

Shuai Wang et al.

Summary: This paper introduces an extendable multiple nodes tracking framework and proposes a general recurrent tracking unit for scoring track proposals. Additionally, a method for generating simulated tracking data is presented. Experimental results show that these methods achieve state-of-the-art performance in multiple-object tracking tasks, and the recurrent tracking unit also brings significant improvements in other trackers.

IEEE TRANSACTIONS ON IMAGE PROCESSING (2022)

Review Chemistry, Analytical

A Review of Deep Learning-Based Methods for Pedestrian Trajectory Prediction

Bogdan Ilie Sighencea et al.

Summary: Pedestrian trajectory prediction is a key task in areas such as self-driving auto vehicles and mobile robots, with state-of-the-art methods benefiting from advancements in sensors and signal processing technologies. This paper reviews recent deep learning-based solutions for the problem and provides an overview of datasets, performance metrics, and practical applications. Furthermore, the study highlights research gaps and potential new directions for future work.

SENSORS (2021)

Article Engineering, Electrical & Electronic

Language-Guided Navigation via Cross-Modal Grounding and Alternate Adversarial Learning

Weixia Zhang et al.

Summary: The main challenges of the emerging vision-and-language navigation (VLN) problem arise from the combination of language instructions and visual environments, as well as the discrepancy in action selection between training and inference.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2021)

Proceedings Paper Computer Science, Artificial Intelligence

End-to-End Dense Video Captioning with Parallel Decoding

Teng Wang et al.

Summary: Dense video captioning aims to generate multiple associated captions with their temporal locations from the video. The proposed framework PDVC formulates the dense caption generation as a set prediction task, showing better results than traditional methods in experiments.

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) (2021)

Proceedings Paper Computer Science, Information Systems

VGGreNet: A Light-Weight VGGNet with Reused Convolutional Set

Ka-Hou Chan et al.

2020 IEEE/ACM 13TH INTERNATIONAL CONFERENCE ON UTILITY AND CLOUD COMPUTING (UCC 2020) (2020)

Article Computer Science, Artificial Intelligence

CAM-RNN: Co-Attention Model Based RNN for Video Captioning

Bin Zhao et al.

IEEE TRANSACTIONS ON IMAGE PROCESSING (2019)

Proceedings Paper Computer Science, Artificial Intelligence

Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning

Tanzila Rahman et al.

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) (2019)

Proceedings Paper Computer Science, Artificial Intelligence

Streamlined Dense Video Captioning

Jonghwan Mun et al.

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019) (2019)

Proceedings Paper Computer Science, Artificial Intelligence

Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning

Jingwen Wang et al.

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) (2018)

Proceedings Paper Computer Science, Artificial Intelligence

Weakly Supervised Dense Video Captioning

Zhiqiang Shen et al.

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017) (2017)

Proceedings Paper Computer Science, Artificial Intelligence

Dense-Captioning Events in Videos

Ranjay Krishna et al.

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV) (2017)

Proceedings Paper Computer Science, Artificial Intelligence

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Joao Carreira et al.

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017) (2017)

Review Multidisciplinary Sciences

Deep learning

Yann LeCun et al.

NATURE (2015)

Proceedings Paper Computer Science, Artificial Intelligence

Learning Spatiotemporal Features with 3D Convolutional Networks

Du Tran et al.

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV) (2015)

Article Computer Science, Artificial Intelligence

The METEOR metric for automatic evaluation of machine translation

Alon Lavie et al.

MACHINE TRANSLATION (2009)

Article Computer Science, Artificial Intelligence

Feature-based sequence-to-sequence matching

Yaron Caspi et al.

INTERNATIONAL JOURNAL OF COMPUTER VISION (2006)