4.7 Article

Efficient Video Grounding With Which-Where Reading Comprehension

相关参考文献

注意:仅列出部分参考文献,下载原文获取全部文献信息。
Article Engineering, Electrical & Electronic

Action-Centric Relation Transformer Network for Video Question Answering

Jipeng Zhang et al.

Summary: Video question answering (VideoQA) is a popular research topic that has received a lot of attention in recent years. Researchers have focused on fusion strategies and feature preparation, but little attention has been given to incorporating actions of interest and exploring frame-to-frame relations. This study introduces an action-centric relation transformer network (ACRTransformer) that addresses these issues and demonstrates superior performance over previous state-of-the-art models.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2022)

Article Engineering, Electrical & Electronic

Learning Video Moment Retrieval Without a Single Annotated Video

Junyu Gao et al.

Summary: This paper proposes an alternative approach to video moment retrieval that does not require textual annotations of videos. It leverages existing visual concept detectors and a pre-trained image-sentence embedding space. By utilizing a video-conditioned sentence generator, a GNN-based relation-aware moment localizer, and a pre-trained image-sentence embedding space, the proposed method achieves effective video moment retrieval.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2022)

Article Engineering, Electrical & Electronic

Human-Centric Spatio-Temporal Video Grounding With Visual Transformers

Zongheng Tang et al.

Summary: In this work, a novel task called Human-centric Spatio-Temporal Video Grounding (HC-STVG) is introduced. HC-STVG aims to localize a spatio-temporal tube of the target person from an untrimmed video based on a given textual description, focusing on humans. This task is useful for healthcare and security applications.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2022)

Article Engineering, Electrical & Electronic

End-to-End Video Question-Answer Generation With Generator-Pretester Network

Hung-Ting Su et al.

Summary: This study introduces a new task of Video Question-Answer Generation (VQAG) to train video question answering models using question-answer pairs generated based on videos. The proposed Generator-Pretester Network is designed to verify the generated questions by attempting to answer them. Experimental results show that the approach achieves state-of-the-art question generation performances on two large-scale human-annotated Video QA datasets and outperforms some supervised baselines in the Video QA task.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2021)

Article Engineering, Electrical & Electronic

Long-Term Video Question Answering via Multimodal Hierarchical Memory Attentive Networks

Ting Yu et al.

Summary: Humans can identify critical moments and analyze evidence for reasoning, researchers propose multimodal hierarchical memory attentive networks for long-term video question answering. The method outperforms state-of-the-art approaches on public benchmarks and ablation studies are conducted to explore the effectiveness of the model.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2021)

Article Engineering, Electrical & Electronic

Fine-Grained Instance-Level Sketch-Based Video Retrieval

Peng Xu et al.

Summary: This research introduces a novel fine-grained instance-level sketch-based video retrieval problem and dataset, and proposes a multi-stream multi-modality deep network with a relation module to improve the matching of visual appearance and motion at a fine-grained level. The results show that this model outperforms existing state-of-the-art models designed for video analysis.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2021)

Proceedings Paper Computer Science, Artificial Intelligence

Structured Multi-Level Interaction Network for Video Moment Localization via Language Query

Hao Wang et al.

Summary: In this paper, we address the problem of localizing a specific moment described by a natural language query. We propose a novel Structured Multi-level Interaction Network (SMIN) that disentangles the activity moment into boundary and content, and conducts multi-level cross-modal interaction and content-boundary-moment interaction to tackle this problem. Experimental results show that the proposed approach outperforms the state-of-the-art methods on three benchmarks.

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 (2021)

Proceedings Paper Computer Science, Artificial Intelligence

Context-aware Biaffine Localizing Network for Temporal Sentence Grounding

Daizong Liu et al.

Summary: This paper proposes a novel localization framework for temporal sentence grounding, scoring all pairs of start and end indices within the video simultaneously with a biaffine mechanism. The Context-aware Biaffine Localizing Network (CBLN) incorporates local and global contexts and a multi-modal self-attention module, outperforming state-of-the-art methods on three public datasets.

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 (2021)

Article Engineering, Electrical & Electronic

Self-Guided Body Part Alignment With Relation Transformers for Occluded Person Re-Identification

Guanshuo Wang et al.

Summary: This method introduces the Self-guided Body Part Alignment technique, which avoids high dependence on external cues and performs well in occluded and holistic person re-identification tasks.

IEEE SIGNAL PROCESSING LETTERS (2021)

Article Computer Science, Artificial Intelligence

MABAN: Multi-Agent Boundary-Aware Network for Natural Language Moment Retrieval

Xiaoyang Sun et al.

Summary: The amount of videos and surveillance cameras is growing, with paired sentence descriptions being significant clues for selecting attentional contents from videos. The task of natural language moment retrieval has drawn great interest and requires temporal context comprehension. To address limited moment selection and insufficient comprehension of structural context, a multi-agent boundary-aware network (MABAN) is proposed, utilizing reinforcement learning and cross-modal interaction for enhanced effectiveness.

IEEE TRANSACTIONS ON IMAGE PROCESSING (2021)

Article Computer Science, Artificial Intelligence

Local Correspondence Network for Weakly Supervised Temporal Sentence Grounding

Wenfei Yang et al.

Summary: LCNet utilizes hierarchical representation of video and text features and introduces a self-supervised cycle-consistent loss to effectively learn the matching relationships between video and text, achieving superior performance compared to existing weakly supervised methods.

IEEE TRANSACTIONS ON IMAGE PROCESSING (2021)

Article Computer Science, Artificial Intelligence

Contour-Aware Loss: Boundary-Aware Learning for Salient Object Segmentation

Zixuan Chen et al.

Summary: The learning model utilizes boundary information for salient object segmentation, with a novel Contour Loss function guiding the perception of object boundaries, enhancing the segmentation effectiveness. Experimental results demonstrate superior performance, with real-time speed achieved on a TITAN X GPU.

IEEE TRANSACTIONS ON IMAGE PROCESSING (2021)

Article Engineering, Electrical & Electronic

Skeleton-Based Action Recognition With Focusing-Diffusion Graph Convolutional Networks

Jialin Gao et al.

Summary: This letter proposes a FDGCN to address the spatial-temporal context exploration issue in skeleton-based action recognition, which processes each skeleton frame through focusing and diffusion processes, combines with a Transformer encoder layer to capture temporal context, and ultimately achieves context transfer among spatial joints.

IEEE SIGNAL PROCESSING LETTERS (2021)

Article Computer Science, Information Systems

Video Storytelling: Textual Summaries for Events

Junnan Li et al.

IEEE TRANSACTIONS ON MULTIMEDIA (2020)

Article Computer Science, Information Systems

Convolutional neural network with adaptive inferential framework for skeleton-based action recognition

Hong'en Huang et al.

JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION (2020)

Article Computer Science, Artificial Intelligence

Revisiting Anchor Mechanisms for Temporal Action Localization

Le Yang et al.

IEEE TRANSACTIONS ON IMAGE PROCESSING (2020)

Article Computer Science, Artificial Intelligence

Learning Semantics-Preserving Attention and Contextual Interaction for Group Activity Recognition

Yansong Tang et al.

IEEE TRANSACTIONS ON IMAGE PROCESSING (2019)

Article Computer Science, Artificial Intelligence

Breaking Winner-Takes-All: Iterative-Winners-Out Networks for Weakly Supervised Temporal Action Localization

Runhao Zeng et al.

IEEE TRANSACTIONS ON IMAGE PROCESSING (2019)

Proceedings Paper Computer Science, Artificial Intelligence

General Interaction-Aware Neural Network for Action Recognition

Jialin Gao et al.

PRICAI 2019: TRENDS IN ARTIFICIAL INTELLIGENCE, PT III (2019)

Proceedings Paper Computer Science, Artificial Intelligence

Language-driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model

Weining Wang et al.

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019) (2019)

Proceedings Paper Computer Science, Information Systems

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos

Zhu Zhang et al.

PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19) (2019)

Article Computer Science, Artificial Intelligence

DREAM: A Challenge Data Set and Models for Dialogue-Based Reading Comprehension

Kai Sun et al.

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (2019)

Article Engineering, Electrical & Electronic

Large-Scale Video Retrieval Using Image Queries

Andre Araujo et al.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2018)

Article Engineering, Electrical & Electronic

Nonlinear Structural Hashing for Scalable Video Search

Zhixiang Chen et al.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2018)

Proceedings Paper Computer Science, Information Systems

Attentive Moment Retrieval in Videos

Meng Liu et al.

ACM/SIGIR PROCEEDINGS 2018 (2018)

Article Computer Science, Artificial Intelligence

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Shaoqing Ren et al.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2017)

Proceedings Paper Computer Science, Artificial Intelligence

Dense-Captioning Events in Videos

Ranjay Krishna et al.

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV) (2017)

Proceedings Paper Computer Science, Artificial Intelligence

Localizing Moments in Video with Natural Language

Lisa Anne Hendricks et al.

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV) (2017)

Proceedings Paper Computer Science, Artificial Intelligence

Mask R-CNN

Kaiming He et al.

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV) (2017)

Proceedings Paper Computer Science, Artificial Intelligence

TALL: Temporal Activity Localization via Language Query

Jiyang Gao et al.

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV) (2017)

Proceedings Paper Computer Science, Artificial Intelligence

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Joao Carreira et al.

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017) (2017)

Proceedings Paper Computer Science, Artificial Intelligence

Learning Spatiotemporal Features with 3D Convolutional Networks

Du Tran et al.

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV) (2015)