☆ 4.6 Article

EmbedFormer: Embedded Depth-Wise Convolution Layer for Token Mixing

SENSORS (2022)

期刊

SENSORS

卷 22, 期 24, 页码 -

出版社

MDPI

DOI: 10.3390/s22249854

关键词

deep learning; computer vision; CNN; vision transformer

类别

Chemistry, Analytical Engineering, Electrical & Electronic Instruments & Instrumentation

资金

National Natural Science Foundation of China (NSFC)
Natural Science Foundation of Zhejiang Province
[61572023]
[62272419]
[LZ22F020010]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This article introduces a new convolutional transformer model EmbedFormer, which enhances model performance by introducing DwConv as the token mixer, and achieves excellent results in various tasks.

Visual Transformers (ViTs) have shown impressive performance due to their powerful coding ability to catch spatial and channel information. MetaFormer gives us a general architecture of transformers consisting of a token mixer and a channel mixer through which we can generally understand how transformers work. It is proved that the general architecture of the ViTs is more essential to the models' performance than self-attention mechanism. Then, Depth-wise Convolution layer (DwConv) is widely accepted to replace local self-attention in transformers. In this work, a pure convolutional transformer is designed. We rethink the difference between the operation of self-attention and DwConv. It is found that the self-attention layer, with an embedding layer, unavoidably affects channel information, while DwConv only mixes the token information per channel. To address the differences between DwConv and self-attention, we implement DwConv with an embedding layer before as the token mixer to instantiate a MetaFormer block and a model named EmbedFormer is introduced. Meanwhile, SEBlock is applied in the channel mixer part to improve performance. On the ImageNet-1K classification task, EmbedFormer achieves top-1 accuracy of 81.7% without additional training images, surpassing the Swin transformer by +0.4% in similar complexity. In addition, EmbedFormer is evaluated in downstream tasks and the results are entirely above those of PoolFormer, ResNet and DeiT. Compared with PoolFormer-S24, another instance of MetaFormer, our EmbedFormer improves the score by +3.0% box AP/+2.3% mask AP on the COCO dataset and +1.3% mIoU on the ADE20K.

EmbedFormer: Embedded Depth-Wise Convolution Layer for Token Mixing

期刊

SENSORS

出版社

MDPI

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

EmbedFormer: Embedded Depth-Wise Convolution Layer for Token Mixing

期刊

SENSORS

出版社

MDPI

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文