4.7 Article

Deep Hierarchical Vision Transformer for Hyperspectral and LiDAR Data Classification

Journal

IEEE TRANSACTIONS ON IMAGE PROCESSING
Volume 31, Issue -, Pages 3095-3110

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TIP.2022.3162964

Keywords

Feature extraction; Transformers; Hyperspectral imaging; Laser radar; Data mining; Collaboration; Data models; Hyperspectral image; light detection and ranging; joint classification; vision transformer; convolutional vision transformer; cross attention fusion

Funding

  1. National Natural Science Foundation of China [42101458, 41801388, 42130112]

Ask authors/readers for more resources

A novel deep hierarchical vision transformer architecture is developed for joint classification of hyperspectral and LiDAR data, utilizing long-range dependency modeling and strong generalization ability of transformer networks based on self-attention mechanism for improved collaborative classification performance of remote sensing data.
In this study, we develop a novel deep hierarchical vision transformer (DHViT) architecture for hyperspectral and light detection and ranging (LiDAR) data joint classification. Current classification methods have limitations in heterogeneous feature representation and information fusion of multi-modality remote sensing data (e.g., hyperspectral and LiDAR data), these shortcomings restrict the collaborative classification accuracy of remote sensing data. The proposed deep hierarchical vision transformer architecture utilizes both the powerful modeling capability of long-range dependencies and strong generalization ability across different domains of the transformer network, which is based exclusively on the self-attention mechanism. Specifically, the spectral sequence transformer is exploited to handle the long-range dependencies along the spectral dimension from hyperspectral images, because all diagnostic spectral bands contribute to the land cover classification. Thereafter, we utilize the spatial hierarchical transformer structure to extract hierarchical spatial features from hyperspectral and LiDAR data, which are also crucial for classification. Furthermore, the cross attention (CA) feature fusion pattern could adaptively and dynamically fuse heterogeneous features from multi-modality data, and this contextual aware fusion mode further improves the collaborative classification performance. Comparative experiments and ablation studies are conducted on three benchmark hyperspectral and LiDAR datasets, and the DHViT model could yield an average overall classification accuracy of 99.58%, 99.55%, and 96.40% on three datasets, respectively, which sufficiently certify the effectiveness and superior performance of the proposed method.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available