4.7 Article

RelationTrack: Relation-Aware Multiple Object Tracking With Decoupled Representation

期刊

IEEE TRANSACTIONS ON MULTIMEDIA
卷 25, 期 -, 页码 2686-2697

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TMM.2022.3150169

关键词

Decoupling representation; deformable attention; multiple object tracking; optimization contradiction; transformer encoder

向作者/读者索取更多资源

Existing online MOT algorithms typically involve detection and re-identification (ReID) as two subtasks. In order to improve efficiency, these subtasks are commonly integrated into a unified framework. However, since detection and ReID require different features, this creates an optimization contradiction during training. To address this issue, we propose the Global Context Disentangling (GCD) module to separate the learned representation into detection-specific and ReID-specific embeddings. Additionally, we develop the Guided Transformer Encoder (GTE) module to capture global semantic relations more effectively, improving the overall performance of the MOT framework.
Existing online multiple object tracking (MOT) algorithms often consist of two subtasks, detection and re-identification (ReID). In order to enhance the inference speed and reduce the complexity, current methods commonly integrate these double subtasks into a unified framework. Nevertheless, detection and ReID demand diverse features. This issue results in an optimization contradiction during the training procedure. With the target of alleviating this contradiction, we devise a module named Global Context Disentangling (GCD) that decouples the learned representation into detection-specific and ReID-specific embeddings. As such, this module provides an implicit manner to balance the different requirements of these two subtasks. Moreover, we observe that preceding MOT methods typically leverage local information to associate the detected targets and neglect to consider the global semantic relation. To resolve this limitation, we develop a module, referred to as Guided Transformer Encoder (GTE), by combining the powerful reasoning ability of Transformer encoder and deformable attention. Unlike previous works, GTE avoids analyzing all the pixels and only attends to capture the relation between query nodes and a few self-adaptively selected key samples. Therefore, it is computationally efficient. Extensive experiments have been conducted on the MOT16, MOT17 and MOT20 benchmarks to demonstrate the superiority of the proposed MOT framework, namely RelationTrack. The experimental results indicate that RelationTrack has surpassed preceding methods significantly and established a new state-of-the-art performance, e.g., IDF1 of 70.5% and MOTA of 67.2% on MOT20.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据