☆ 4.7 Article

Context and Structure Mining Network for Video Object Detection

INTERNATIONAL JOURNAL OF COMPUTER VISION (2021)

期刊

INTERNATIONAL JOURNAL OF COMPUTER VISION

卷 129, 期 10, 页码 2927-2946

出版社

SPRINGER

DOI: 10.1007/s11263-021-01507-2

关键词

Video object detection; Spatial-temporal; Context and structure mining; Cross patch matching

类别

Computer Science, Artificial Intelligence

资金

NSF [CMMI-1646162, CMMI-1954548]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Aggregating temporal features from other frames has been proven effective for video object detection. However, traditional methods have shortcomings in ignoring useful context information and aggregating proposal-level features as a whole. To address this, a Context and Structure Mining Network is proposed to better aggregate features for video object detection.

Aggregating temporal features from other frames is verified to be very effective for video object detection to overcome the challenges in still images, such as occlusion, motion blur, and rare pose. Currently, proposal-level feature aggregation dominates this direction. However, there are two main problems for the holistic proposal-level feature aggregation. First, the object proposals generated by the region proposal network ignore the useful context information around the object which is proved to be helpful for object classification. Second, the traditional proposal-level feature aggregation regards the proposal as a whole without considering the important object structure information, which makes the similarity comparison between two proposals less effective when occlusion or pose misalignment occurs on proposal objects. To deal with these problems, we propose the Context and Structure Mining Network to better aggregate features for video object detection. In our method, we first encode the spatial-temporal context information into object features in a global manner, which can benefit the object classification. In addition, the holistic proposal is divided into several patches to capture the structure information of the object, and cross patch matching is conducted to alleviate the pose misalignment between objects in target and support proposals. Moreover, an importance weight is learned for each target proposal patch to indicate how informative this patch is for the final feature aggregation, by which the occluded patches can be neglected. This enables the aggregation module to leverage the most important and informative patches to obtain the final feature aggregation. The proposed framework outperforms all the latest state-of-the-art methods on the ImageNet VID dataset with a large margin. This project is publicly available https://github.com/LiangHann/Context-and-Structure-Mining-Network-for-Video-Object-Detection.

Context and Structure Mining Network for Video Object Detection

期刊

INTERNATIONAL JOURNAL OF COMPUTER VISION

出版社

SPRINGER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Context and Structure Mining Network for Video Object Detection

期刊

INTERNATIONAL JOURNAL OF COMPUTER VISION

出版社

SPRINGER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文