4.8 Article

Token Selection is a Simple Booster for Vision Transformers

出版社

IEEE COMPUTER SOC
DOI: 10.1109/TPAMI.2022.3208922

关键词

Transformers; Task analysis; Training; Magnetic heads; Head; Aggregates; Computer architecture; Image classification; vision transformer; semantic segmentation; token selection

向作者/读者索取更多资源

This article explores the token selection behavior of self-attention and proposes simple approaches to enhance selectivity and diversity. By developing a token selector module, it significantly boosts the performance of various ViT backbones. These approaches allow ViTs to achieve high accuracy with a relatively small number of parameters and can be applied to different models for image classification, semantic segmentation, and NLP tasks.
Vision transformers have recently attained state-of-the-art results in visual recognition tasks. Their success is largely attributed to the self-attention component, which models the global dependencies among the image patches (tokens) and aggregates them into higher-level features. However, self-attention brings significant training difficulties to ViTs. Many recent works thus develop various new self-attention components to alleviate this issue. In this article, instead of developing complicated self-attention mechanism, we aim to explore simple approaches to fully release the potential of the vanilla self-attention. We first study the token selection behavior of self-attention and find that it suffers from a low diversity due to attention over-smoothing, which severely limits its effectiveness in learning discriminative token features. We then develop simple approaches to enhance selectivity and diversity for self-attention in token selection. The resulted token selector module can server as a drop-in module for various ViT backbones and consistently boost their performance. Significantly, they enable ViTs to achieve 84.6% top-1 classification accuracy on ImageNet with only 25M parameters. When scaled up to 81M parameters, the result can be further improved to 86.1%. In addition, we also present comprehensive experiments to demonstrate the token selector can be applied to a variety of transformer-based models to boost their performance for image classification, semantic segmentation and NLP tasks. Code is available at https://github.com/zhoudaquan/dvit_repo.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据