4.7 Article

What Limits the Performance of Local Self-attention?

期刊

INTERNATIONAL JOURNAL OF COMPUTER VISION
卷 131, 期 10, 页码 2516-2528

出版社

SPRINGER
DOI: 10.1007/s11263-023-01813-x

关键词

Vision transformer; Local self-attention; Representation learning; Image classification

向作者/读者索取更多资源

Although self-attention is powerful in modeling long-range dependencies, the performance of local self-attention (LSA) is similar to depth-wise convolution. To clarify the differences and limitations, a comprehensive investigation on LSA and its counterparts was conducted. The study finds that attention generation and application, including relative position embedding and neighboring filter application, are key factors. Based on these findings, an enhanced version called ELSA is proposed, which introduces Hadamard attention and the ghost head. Experiments demonstrate the effectiveness of ELSA in various tasks.
Although self-attention is powerful in modeling long-range dependencies, the performance of local self-attention (LSA) is just similar to depth-wise convolution, which puzzles researchers on whether to use LSA or its counterparts, which one is better, and what limits the performance of LSA. To clarify these, we comprehensively investigate LSA and its counterparts from channel setting and spatial processing. We find that the devil lies in attention generation and application, where relative position embedding and neighboring filter application are key factors. Based on these findings, we propose enhanced local self-attention (ELSA) with Hadamard attention and the ghost head. Hadamard attention introduces the Hadamard product to efficiently generate attention in the neighboring area, while maintaining the high-order mapping. The ghost head combines attention maps with static matrices to increase channel capacity. Experiments demonstrate the effectiveness of ELSA. Without architecture/hyperparameter modification, drop-in replacing LSA with ELSA boosts Swin Transformer by up to +1.4 on top-1 accuracy. ELSA also consistently benefits VOLO from D1 to D5, where ELSA-VOLO-D5 achieves 87.2 on the ImageNet-1K without extra training images. In addition, we evaluate ELSA in downstream tasks. ELSA significantly improves the baseline by up to +1.9 box Ap/+1.3 mask Ap on the COCO, and by up to +1.9 mIoU on the ADE20K.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据