4.7 Article

Context and Structure Mining Network for Video Object Detection

期刊

INTERNATIONAL JOURNAL OF COMPUTER VISION
卷 129, 期 10, 页码 2927-2946

出版社

SPRINGER
DOI: 10.1007/s11263-021-01507-2

关键词

Video object detection; Spatial-temporal; Context and structure mining; Cross patch matching

资金

  1. NSF [CMMI-1646162, CMMI-1954548]

向作者/读者索取更多资源

Aggregating temporal features from other frames has been proven effective for video object detection. However, traditional methods have shortcomings in ignoring useful context information and aggregating proposal-level features as a whole. To address this, a Context and Structure Mining Network is proposed to better aggregate features for video object detection.
Aggregating temporal features from other frames is verified to be very effective for video object detection to overcome the challenges in still images, such as occlusion, motion blur, and rare pose. Currently, proposal-level feature aggregation dominates this direction. However, there are two main problems for the holistic proposal-level feature aggregation. First, the object proposals generated by the region proposal network ignore the useful context information around the object which is proved to be helpful for object classification. Second, the traditional proposal-level feature aggregation regards the proposal as a whole without considering the important object structure information, which makes the similarity comparison between two proposals less effective when occlusion or pose misalignment occurs on proposal objects. To deal with these problems, we propose the Context and Structure Mining Network to better aggregate features for video object detection. In our method, we first encode the spatial-temporal context information into object features in a global manner, which can benefit the object classification. In addition, the holistic proposal is divided into several patches to capture the structure information of the object, and cross patch matching is conducted to alleviate the pose misalignment between objects in target and support proposals. Moreover, an importance weight is learned for each target proposal patch to indicate how informative this patch is for the final feature aggregation, by which the occluded patches can be neglected. This enables the aggregation module to leverage the most important and informative patches to obtain the final feature aggregation. The proposed framework outperforms all the latest state-of-the-art methods on the ImageNet VID dataset with a large margin. This project is publicly available https://github.com/LiangHann/Context-and-Structure-Mining-Network-for-Video-Object-Detection.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据