4.7 Article

What Limits the Performance of Local Self-attention?

Related references

Note: Only part of the references are listed.
Article Computer Science, Artificial Intelligence

VOLO: Vision Outlooker for Visual Recognition

Li Yuan et al.

Summary: Vision Transformers (ViTs) have lower efficiency and limited feature richness compared to CNNs due to the simple tokenization of images and redundant attention backbone design. To overcome these limitations, a new architecture called VOLO is proposed, which uses outlook attention to dynamically aggregate local features. VOLO can efficiently encode fine-level features and achieve high-performance visual recognition.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2023)

Proceedings Paper Computer Science, Artificial Intelligence

KVT: κ-NN Attention for Boosting Vision Transformers

Pichao Wang et al.

Summary: Convolutional Neural Networks (CNNs) have been dominant in computer vision for a long time, but recent vision transformer architectures have shown promising performance. This paper proposes a new approach called kappa-NN attention to enhance vision transformers by selecting the most similar tokens for attention map calculation.

COMPUTER VISION, ECCV 2022, PT XXIV (2022)

Proceedings Paper Computer Science, Artificial Intelligence

Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation

Jiaqi Gu et al.

Summary: In this paper, we propose HRViT, a method to enhance the performance of ViTs on semantic segmentation tasks. By integrating high-resolution multi-branch architectures with ViTs and using various optimization techniques, we improve the performance and efficiency of the model. Experimental results demonstrate that HRViT outperforms existing MiT and CSWin backbones on ADE20K and Cityscapes.

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) (2022)

Proceedings Paper Computer Science, Artificial Intelligence

CMT: Convolutional Neural Networks Meet Vision Transformers

Jianyuan Guo et al.

Summary: This paper introduces a novel hybrid network based on transformers and CNNs, called CMTs, which performs well in image recognition tasks and achieves a better trade-off between accuracy and computational efficiency.

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) (2022)

Proceedings Paper Computer Science, Artificial Intelligence

Swin Transformer V2: Scaling Up Capacity and Resolution

Ze Liu et al.

Summary: This paper presents techniques for scaling Swin Transformer up to 3 billion parameters and the ability to train with high-resolution images. By increasing the capacity and resolution, Swin Transformer achieves new records on four representative vision benchmarks. Several novel technologies are proposed to address training instability and effectively transfer models from low-resolution to high-resolution. Using these techniques and self-supervised pre-training, a strong 3 billion Swin Transformer model is successfully trained, achieving state-of-the-art accuracy on various benchmarks.

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) (2022)

Proceedings Paper Computer Science, Artificial Intelligence

A ConvNet for the 2020s

Zhuang Liu et al.

Summary: The development of visual recognition has gone through stages from ConvNets to ViTs and then to hybrid approaches. In this work, the design of a pure ConvNet is reexamined and several key components are discovered, resulting in the construction of the ConvNeXt model series. These models compete with Transformers in terms of accuracy and performance while maintaining the simplicity and efficiency of ConvNets.

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) (2022)

Proceedings Paper Computer Science, Artificial Intelligence

Mobile-Former: Bridging MobileNet and Transformer

Yinpeng Chen et al.

Summary: Mobile-Former is a parallel design of MobileNet and transformer with a two-way bridge, combining the advantages of both models for efficient computation and enhanced representation power across image classification and object detection tasks.

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) (2022)

Proceedings Paper Computer Science, Artificial Intelligence

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

Xiaoyi Dong et al.

Summary: CSWin Transformer is an efficient and effective Transformer-based backbone for general-purpose vision tasks. It achieves competitive performance by using the Cross-Shaped Window self-attention mechanism, Locally-enhanced Positional Encoding, and a hierarchical structure.

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) (2022)

Article Computer Science, Artificial Intelligence

CARAFE plus plus : Unified Content-Aware ReAssembly of FEatures

Jiaqi Wang et al.

Summary: CARAFE++ is a universal, lightweight, and highly effective operator for feature reassembly in convolutional networks. It aggregates contextual information within a large receptive field, generates adaptive kernels for instance-specific content-aware handling, and introduces little computational overhead. It consistently shows significant improvements in various tasks, making it a strong building block for modern deep networks.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2022)

Proceedings Paper Computer Science, Artificial Intelligence

Decoupled Dynamic Filter Networks

Jingkai Zhou et al.

Summary: The study introduces the Decoupled Dynamic Filter (DDF) to address the two main shortcomings of standard convolution, achieving performance improvement by decomposing the dynamic filter, limiting parameter numbers and computational costs, and replacing standard convolution with DDF in classification networks.

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 (2021)

Proceedings Paper Computer Science, Artificial Intelligence

Dynamic Region-Aware Convolution

Jin Chen et al.

Summary: DRConv, through Dynamic Region-Aware Convolution, effectively handles spatial information by improving the representation ability of convolution while maintaining computational cost and translation-invariance.

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 (2021)

Proceedings Paper Computer Science, Artificial Intelligence

Involution: Inverting the Inherence of Convolution for Visual Recognition

Duo Li et al.

Summary: The study introduces a new operation named involution to replace standard convolution for vision tasks, showing improved performance of models while reducing computational costs.

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 (2021)

Proceedings Paper Computer Science, Artificial Intelligence

CARAFE: Content-Aware ReAssembly of FEatures

Jiaqi Wang et al.

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) (2019)

Proceedings Paper Computer Science, Artificial Intelligence

Mask R-CNN

Kaiming He et al.

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV) (2017)

Proceedings Paper Computer Science, Artificial Intelligence

Xception: Deep Learning with Depthwise Separable Convolutions

Francois Chollet

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017) (2017)