4.6 Article

Exploring vision transformer: classifying electron-microscopy pollen images with transformer

Journal

NEURAL COMPUTING & APPLICATIONS
Volume 35, Issue 1, Pages 735-748

Publisher

SPRINGER LONDON LTD
DOI: 10.1007/s00521-022-07789-y

Keywords

Image classification; Vision transformer; Self-attention; Knowledge distillation

Ask authors/readers for more resources

Pollen identification has broad applications in various fields, and pollen allergy is a common and frequent disease. Accurate and rapid identification of pollen species under the electron microscope can help with pollen forecast and treatment. In this study, a new Vision Transformer pipeline for image classification is proposed, which achieves CNN-equivalent performance on the pollen dataset with fewer model parameters and training time.
Pollen identification is a sub-discipline of Palynology, which has broad applications in several fields such as allergy control, paleoclimate reconstruction, criminal investigation, and petroleum exploration. Among these, pollen allergy is a common and frequent disease worldwide. Accurate and rapid identification of pollen species under the electron microscope help medical staff in pollen forecast and interrupt the natural course of pollen allergy. The current pollen species identification needs to rely on professional researchers to identify pollen particles in pictures manually, and this time-consuming and laborious way cannot meet the requirements of pollen forecasting. Recently, the self-attention based Transformer has attracted considerable attention in vision tasks, such as image classification. However, pure self-attention lacks local operations on pixels and requires large-scale dataset pretraining to achieve comparable performance to convolutional neural networks (CNN). In this study, we propose a new Vision Transformer pipeline for image classification. First, we design a FeatureMap-to-Token (F2T) module to perform token embedding on the input image. A global self-attention operation is performed on the basis of tokens with local information, and the hierarchical design of CNN is applied to the Vision Transformer, combining local and global strengths in multiscale spaces. Second, we use a distillation strategy to learn the feature representation in the output space of the teacher network to further learn the inductive bias in the CNN to improve the recognition accuracy. Experiments demonstrate that the proposed model achieves CNN-equivalent performance under the same conditions after being trained from scratch on the electron-microscopic pollen dataset. It also requires less model parameters and training time. Code for the model is available at https://github.com/dkbshuai/PyTorchOur-S.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available