☆ 4.5 Article

DTCC: Multi-level dilated convolution with transformer for weakly-supervised crowd counting

COMPUTATIONAL VISUAL MEDIA (2023)

期刊

COMPUTATIONAL VISUAL MEDIA

卷 9, 期 4, 页码 859-873

出版社

SPRINGERNATURE

DOI: 10.1007/s41095-022-0313-5

关键词

crowd counting; transformer; dilated convolution; global perspective field; pyramid

类别

Computer Science, Software Engineering

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Crowd counting is important for public security and urban management. Existing mainstream methods use convolutional neural networks (CNNs) to regress a density map but require detailed annotations, while weakly-supervised methods only need count annotations but often overlook the global perspective field and multi-level information. We propose DTCC, a weakly-supervised method that combines multi-level dilated convolution and transformer methods to achieve end-to-end crowd counting. Experimental results on four benchmark datasets show that DTCC outperforms other weakly-supervised methods and is comparable to fully-supervised methods.

Crowd counting provides an important foundation for public security and urban management. Due to the existence of small targets and large density variations in crowd images, crowd counting is a challenging task. Mainstream methods usually apply convolution neural networks (CNNs) to regress a density map, which requires annotations of individual persons and counts. Weakly-supervised methods can avoid detailed labeling and only require counts as annotations of images, but existing methods fail to achieve satisfactory performance because a global perspective field and multi-level information are usually ignored. We propose a weakly-supervised method, DTCC, which effectively combines multi-level dilated convolution and transformer methods to realize end-to-end crowd counting. Its main components include a recursive swin transformer and a multi-level dilated convolution regression head. The recursive swin transformer combines a pyramid visual transformer with a fine-tuned recursive pyramid structure to capture deep multi-level crowd features, including global features. The multi-level dilated convolution regression head includes multi-level dilated convolution and a linear regression head for the feature extraction module. This module can capture both low- and high-level features simultaneously to enhance the receptive field. In addition, two regression head fusion mechanisms realize dynamic and mean fusion counting. Experiments on four well-known benchmark crowd counting datasets (UCF_CC_50, ShanghaiTech, UCF_QNRF, and JHU-Crowd++) show that DTCC achieves results superior to other weakly-supervised methods and comparable to fully-supervised methods.

DTCC: Multi-level dilated convolution with transformer for weakly-supervised crowd counting

期刊

COMPUTATIONAL VISUAL MEDIA

出版社

SPRINGERNATURE

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

DTCC: Multi-level dilated convolution with transformer for weakly-supervised crowd counting

期刊

COMPUTATIONAL VISUAL MEDIA

出版社

SPRINGERNATURE

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文