☆ 3.8 Proceedings Paper

KVT: κ-NN Attention for Boosting Vision Transformers

COMPUTER VISION, ECCV 2022, PT XXIV (2022)

期刊

COMPUTER VISION, ECCV 2022, PT XXIV

卷 13684, 期 -, 页码 285-302

出版社

SPRINGER INTERNATIONAL PUBLISHING AG

DOI: 10.1007/978-3-031-20053-3_17

关键词

类别

Computer Science, Artificial Intelligence Imaging Science & Photographic Technology

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Convolutional Neural Networks (CNNs) have been dominant in computer vision for a long time, but recent vision transformer architectures have shown promising performance. This paper proposes a new approach called kappa-NN attention to enhance vision transformers by selecting the most similar tokens for attention map calculation.

Convolutional Neural Networks (CNNs) have dominated computer vision for years, due to its ability in capturing locality and translation invariance. Recently, many vision transformer architectures have been proposed and they show promising performance. A key component in vision transformers is the fully-connected self-attention which is more powerful than CNNs in modelling long range dependencies. However, since the current dense self-attention uses all image patches (tokens) to compute attention matrix, it may neglect locality of images patches and involve noisy tokens (e.g., clutter background and occlusion), leading to a slow training process and potential degradation of performance. To address these problems, we propose the kappa-NN attention for boosting vision transformers. Specifically, instead of involving all the tokens for attention matrix calculation, we only select the top-kappa similar tokens from the keys for each query to compute the attention map. The proposed kappa-NN attention naturally inherits the local bias of CNNs without introducing convolutional operations, as nearby tokens tend to be more similar than others. In addition, the kappa-NN attention allows for the exploration of long range correlation and at the same time filters out irrelevant tokens by choosing the most similar tokens from the entire image. Despite its simplicity, we verify, both theoretically and empirically, that kappa-NN attention is powerful in speeding up training and distilling noise from input tokens. Extensive experiments are conducted by using 11 different vision transformer architectures to verify that the proposed kappa-NN attention can work with any existing transformer architectures to improve its prediction performance. The codes are available at https://github.com/damo-cv/KVT.

KVT: κ-NN Attention for Boosting Vision Transformers

期刊

COMPUTER VISION, ECCV 2022, PT XXIV

出版社

SPRINGER INTERNATIONAL PUBLISHING AG

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

KVT: κ-NN Attention for Boosting Vision Transformers

期刊

COMPUTER VISION, ECCV 2022, PT XXIV

出版社

SPRINGER INTERNATIONAL PUBLISHING AG

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文