☆ 4.7 Article

Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection for Autonomous Driving

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2022)

期刊

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

卷 32, 期 4, 页码 2068-2078

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TCSVT.2021.3082763

关键词

Three-dimensional displays; Object detection; Feature extraction; Laser radar; Correlation; Decoding; Head; Lidar-based video; 3D object detection; transformer; temporal-channel attention

类别

Engineering, Electrical & Electronic

资金

Australian Research Council [DP200103223]
Australian Medical Research Future Fund [MRFAI000085]
Australian Research Council [DP200103223] Funding Source: Australian Research Council

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper proposes a new transformer model, called Temporal-Channel Transformer (TCTR), for video object detection from Lidar data by modeling the temporal-channel and spatial relationships. The model encodes temporal-channel information using the encoder and decodes spatial-wise information using the decoder. A gate mechanism is deployed to refine the representation of the target frame. Experimental results show that TCTR achieves state-of-the-art performance in grid voxel-based 3D object detection on the nuScenes benchmark.

The strong demand of autonomous driving in the industry has led to vigorous interest in 3D object detection and resulted in many excellent 3D object detection algorithms. However, the vast majority of algorithms only model single-frame data, ignoring the temporal clue in video sequence. In this work, we propose a new transformer, called Temporal-Channel Transformer (TCTR), to model the temporal-channel domain and spatial-wise relationships for video object detecting from Lidar data. As the special design of this transformer, the information encoded in the encoder is different from that in the decoder. The encoder encodes temporal-channel information of multiple frames while the decoder decodes the spatial-wise information for the current frame in a voxel-wise manner. Specifically, the temporal-channel encoder of the transformer is designed to encode the information of different channels and frames by utilizing the correlation among features from different channels and frames. On the other hand, the spatial decoder of the transformer decodes the information for each location of the current frame. Before conducting the object detection with detection head, a gate mechanism is further deployed for re-calibrating the features of current frame, which filters out the object-irrelevant information by repetitively refining the representation of target frame along with the up-sampling process. Experimental results reveal that TCTR achieves the state-of-the-art performance in grid voxel-based 3D object detection on the nuScenes benchmark.

Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection for Autonomous Driving

期刊

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection for Autonomous Driving

期刊

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文