3.8 Proceedings Paper

Visual Saliency Transformer

出版社

IEEE
DOI: 10.1109/ICCV48922.2021.00468

关键词

-

资金

  1. National Key R&D Program of China [2020AAA0105702]
  2. National Science Foundation of China [62027813, 62036005, U20B2065, U20B2068]

向作者/读者索取更多资源

This paper proposes a unified model VST based on Transformer for RGB and RGB-D salient object detection, achieving saliency prediction through modeling long-range dependencies and introducing multi-level token fusion and token upsampling methods under the transformer framework. Additionally, a token-based multi-task decoder is developed to perform saliency and boundary detection simultaneously.
Existing state-of-the-art saliency detection methods heavily rely on CNN-based architectures. Alternatively, we rethink this task from a convolution-free sequence-to-sequence perspective and predict saliency by modeling long-range dependencies, which can not be achieved by convolution. Specifically, we develop a novel unified model based on a pure transformer, namely, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD). It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Unlike conventional architectures used in Vision Transformer (ViT), we leverage multi-level token fusion and propose a new token upsampling method under the transformer framework to get high-resolution detection results. We also develop a token-based multi-task decoder to simultaneously perform saliency and boundary detection by introducing task-related tokens and a novel patch-task-attention mechanism. Experimental results show that our model outperforms existing methods on both RGB and RGB-D SOD benchmark datasets. Most importantly, our whole framework not only provides a new perspective for the SOD field but also shows a new paradigm for transformer-based dense prediction models.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

3.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据