4.8 Article

A Survey on Vision Transformer

相关参考文献

注意:仅列出部分参考文献,下载原文获取全部文献信息。
Article Computer Science, Artificial Intelligence

A Comprehensive Survey of Scene Graphs: Generation and Application

Xiaojun Chang et al.

Summary: Scene graph is a structured representation of a scene, expressing objects, attributes, and relationships. With the development of computer vision, people aim for a higher level of understanding and reasoning about visual scenes. Scene graphs have attracted researchers' attention as a powerful tool for scene understanding.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2023)

Article Engineering, Electrical & Electronic

Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection for Autonomous Driving

Zhenxun Yuan et al.

Summary: This paper proposes a new transformer model, called Temporal-Channel Transformer (TCTR), for video object detection from Lidar data by modeling the temporal-channel and spatial relationships. The model encodes temporal-channel information using the encoder and decodes spatial-wise information using the decoder. A gate mechanism is deployed to refine the representation of the target frame. Experimental results show that TCTR achieves state-of-the-art performance in grid voxel-based 3D object detection on the nuScenes benchmark.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (2022)

Proceedings Paper Computer Science, Artificial Intelligence

Patch Slimming for Efficient Vision Transformers

Yehui Tang et al.

Summary: This paper studies the efficiency problem of visual transformers and proposes a patch slimming approach to reduce redundant calculations. Experimental results demonstrate that the proposed method can significantly reduce computational costs without sacrificing performance.

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) (2022)

Proceedings Paper Computer Science, Artificial Intelligence

CondLaneNet: a Top-to-down Lane Detection Framework Based on Conditional Convolution

Lizhe Liu et al.

Summary: This work proposes a novel top-to-down lane detection framework, CondLaneNet, which dynamically predicts lane instances and line shapes, achieving real-time efficiency and excellent detection accuracy.

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) (2021)

Proceedings Paper Computer Science, Artificial Intelligence

On the Robustness of Vision Transformers to Adversarial Examples

Kaleel Mahmood et al.

Summary: This study investigates the robustness of Vision Transformers to adversarial examples, finding that these examples do not readily transfer between CNNs and Transformers. The researchers introduce a new attack called the self-attention blended gradient attack and analyze the security of a simple ensemble defense of CNNs and Transformers.

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) (2021)

Proceedings Paper Computer Science, Artificial Intelligence

GLiT: Neural Architecture Search for Global and Local Image Transformer

Boyu Chen et al.

Summary: The paper introduces a new Neural Architecture Search (NAS) method to find a better transformer architecture for image recognition. By incorporating a locality module and new search algorithms, the method allows for a trade-off between global and local information, as well as optimizing low-level design choices in each module. Through extensive experiments on the ImageNet dataset, the method demonstrates the ability to find more efficient and discriminative transformer variants compared to existing models like ResNet101 and ViT.

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) (2021)

Proceedings Paper Computer Science, Artificial Intelligence

BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search

Changlin Li et al.

Summary: The paper introduces an unsupervised NAS method called BossNAS to address inaccurate architecture rating caused by large weight-sharing space and biased supervision in previous methods. In a new hybrid CNN-transformer search space, our searched model BossNet-T achieves high accuracy of 82.5% on ImageNet, surpassing EfficientNet by 2.4% with comparable compute time. Furthermore, our method outperforms state-of-the-art NAS methods in architecture rating accuracy on two different search spaces.

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) (2021)

Proceedings Paper Computer Science, Artificial Intelligence

TokenPose: Learning Keypoint Tokens for Human Pose Estimation

Yanjie Li et al.

Summary: This paper introduces a novel approach for human pose estimation based on Token representation, which can learn constraint relationships and appearance cues simultaneously, achieving comparable performance with existing methods in experiments.

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) (2021)

Proceedings Paper Computer Science, Artificial Intelligence

TransPose: Keypoint Localization via Transformer

Sen Yang et al.

Summary: The TransPose model introduces Transformer for human pose estimation, efficiently capturing long-range relationships and revealing dependencies of keypoints. The heatmap-based approach provides fine-grained image-specific dependencies, showing evidence of how the model handles special cases such as occlusion.

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) (2021)

Proceedings Paper Computer Science, Artificial Intelligence

AutoFormer: Searching Transformers for Visual Recognition

Minghao Chen et al.

Summary: AutoFormer is a novel one-shot architecture search framework dedicated to vision transformer search. It outperforms recent models like ViT and DeiT, achieving good accuracy on ImageNet by training a supernet and generating comparable subnets.

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) (2021)

Proceedings Paper Computer Science, Artificial Intelligence

SceneFormer: Indoor Scene Generation with Transformers

Xinpeng Wang et al.

Summary: This study focuses on indoor scene generation using transformers, without relying on appearance information. By using selfattention and cross-attention mechanisms, the model can generate scenes faster and with similar or improved realism compared to existing methods, conditioned on room layout or text descriptions.

2021 INTERNATIONAL CONFERENCE ON 3D VISION (3DV 2021) (2021)

Proceedings Paper Computer Science, Artificial Intelligence

Medical Transformer: Gated Axial-Attention for Medical Image Segmentation

Jeya Maria Jose Valanarasu et al.

Summary: Deep convolutional neural networks have been widely adopted in medical image segmentation, but lack understanding of long-range dependencies due to inherent biases in convolutional architectures. Transformer-based architectures leverage self-attention mechanism to encode long-range dependencies, motivating the exploration of transformer solutions for medical image segmentation tasks.

MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2021, PT I (2021)

Proceedings Paper Computer Science, Artificial Intelligence

Skeletor: Skeletal Transformers for Robust Body-Pose Estimation

Tao Jiang et al.

Summary: The paper introduces a novel transformer-based network, Skeletor, that can unsupervisedly learn the distribution of 3D pose and motion to reduce inaccuracies and inconsistencies in skeletal estimation. Skeletor uses strong priors learned from 25 million frames to smooth and correct skeleton sequences, achieving improved performance on 3D human pose estimation.

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021 (2021)

Proceedings Paper Computer Science, Artificial Intelligence

Temporal Context Aggregation for Video Retrieval with Contrastive Learning

Jie Shao et al.

Summary: The paper introduces TCA framework for video representation learning that incorporates long-range temporal information using self-attention mechanism, and proposes a supervised contrastive learning method with memory bank mechanism to improve negative sample capacity. Extensive experiments show significant performance advantages in multiple video retrieval tasks.

2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021 (2021)

Proceedings Paper Computer Science, Artificial Intelligence

End-to-end Lane Shape Prediction with Transformers

Ruijin Liu et al.

Summary: The study introduces an end-to-end lane detection method that outputs lane shape model parameters using a Transformer network, which improves learning efficiency for global context and lane long and thin structures. It shows state-of-the-art accuracy on the TuSimple benchmark and demonstrates powerful deployment potential in real applications, with the most lightweight model size and fastest speed compared to other methods.

2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021 (2021)

Proceedings Paper Computer Science, Artificial Intelligence

PolyLaneNet: Lane Estimation via Deep Polynomial Regression

Lucas Tabelini et al.

Summary: The advancement of autonomous driving technology is greatly influenced by the emergence of deep learning. Lane detection remains a challenging issue in the quest for safer self-driving vehicles. This study introduces a novel lane detection method that competes with existing techniques in efficiency and accuracy, with additional insights on evaluation metrics limitations and reproducibility.

2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR) (2021)

Article Computer Science, Software Engineering

PCT: Point cloud transformer

Meng-Hao Guo et al.

Summary: This paper introduces a novel framework named Point Cloud Transformer (PCT) for point cloud learning, based on Transformer and enhanced by farthest point sampling and nearest neighbor search for better capturing local context. Extensive experiments demonstrate that the PCT achieves state-of-the-art performance on shape classification, part segmentation, semantic segmentation, and normal estimation tasks.

COMPUTATIONAL VISUAL MEDIA (2021)

Article Computer Science, Information Systems

Point Transformer

Nico Engel et al.

Summary: Point Transformer is a deep neural network that operates directly on unordered and unstructured point sets, extracting local and global features and relating them through a local-global attention mechanism. SortNet induces input permutation invariance by selecting points based on a learned score. The output is a sorted and permutation invariant feature list that can be directly incorporated into common computer vision applications, showing competitive results compared to prior work through evaluation on standard benchmarks.

IEEE ACCESS (2021)

Article Computer Science, Artificial Intelligence

Adversarial Attacks on Deep-learning Models in Natural Language Processing: A Survey

Wei Emma Zhang et al.

ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY (2020)

Review Engineering, Multidisciplinary

Pre-trained models for natural language processing: A survey

Qiu XiPeng et al.

SCIENCE CHINA-TECHNOLOGICAL SCIENCES (2020)

Proceedings Paper Biochemical Research Methods

Attention-Based Transformers for Instance Segmentation of Cells in Microstructures

Tim Prangemeier et al.

2020 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (2020)

Proceedings Paper Computer Science, Artificial Intelligence

SCT: Set Constrained Temporal Transformer for Set Supervised Action Segmentation

Mohsen Fayyaz et al.

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) (2020)

Proceedings Paper Computer Science, Artificial Intelligence

Video Multitask Transformer Network

Hongje Seong et al.

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW) (2019)

Proceedings Paper Computer Science, Artificial Intelligence

Temporal Transformer Networks: Joint Learning of Invariant and Discriminative Time Warping

Suhas Lohit et al.

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019) (2019)

Article Computer Science, Artificial Intelligence

Two-Stream Transformer Networks for Video-Based Face Alignment

Hao Liu et al.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2018)

Article Computer Science, Artificial Intelligence

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Shaoqing Ren et al.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2017)

Article Computer Science, Information Systems

Top-Down Saliency Detection via Contextual Pooling

Jun Zhu et al.

JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY (2014)

Article Computer Science, Software Engineering

DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning

Tianshi Chen et al.

ACM SIGPLAN NOTICES (2014)

Article Computer Science, Theory & Methods

SPECTRAL SPARSIFICATION OF GRAPHS

Daniel A. Spielman et al.

SIAM JOURNAL ON COMPUTING (2011)

Article Multidisciplinary Sciences

The average distances in random graphs with given expected degrees

F Chung et al.

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA (2002)