4.5 Article

PVT v2: Improved baselines with Pyramid Vision Transformer

Journal

COMPUTATIONAL VISUAL MEDIA
Volume 8, Issue 3, Pages 415-424

Publisher

SPRINGERNATURE
DOI: 10.1007/s41095-022-0274-8

Keywords

transformers; dense prediction; image classification; object detection; semantic segmentation

Funding

  1. National Natural Science Foundation of China [61672273, 61832008]
  2. Science Foundation for Distinguished Young Scholars of Jiangsu [BK20160021]
  3. Postdoctoral Innovative Talent Support Program of China [BX20200168, 2020M681608]
  4. General Research Fund of Hong Kong [27208720]

Ask authors/readers for more resources

This work presents the improved Pyramid Vision Transformer v2 (PVT v2) by adding three designs, achieving significant improvements in fundamental vision tasks. PVT v2 performs comparably or better than recent work such as the Swin transformer.
Transformers have recently lead to encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs: (i) a linear complexity attention layer, (ii) an overlapping patch embedding, and (iii) a convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linearity and provides significant improvements on fundamental vision tasks such as classification, detection, and segmentation. In particular, PVT v2 achieves comparable or better performance than recent work such as the Swin transformer. We hope this work will facilitate state-of-the-art transformer research in computer vision. Code is available at https://github.com/whai362/PVT.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available