4.6 Article

Vision Transformer in Industrial Visual Inspection

Journal

APPLIED SCIENCES-BASEL
Volume 12, Issue 23, Pages -

Publisher

MDPI
DOI: 10.3390/app122311981

Keywords

deep learning; computer vision; vision transformer; attention mechanism; automated industrial visual inspection; defect detection

Funding

  1. German Federal Ministry for Digital and Transport in the programfuture rail freight transport [53T20011UW]

Ask authors/readers for more resources

Artificial intelligence has been considered as an approach to visual inspection in industrial applications for decades. Recent advances in deep learning, particularly in attention-based vision transformer architectures, have the potential to enable automated visual inspection even in complex environmental conditions. However, the application of vision transformers to real world visual inspection is still limited, possibly due to the assumption that they require large amounts of data to be effective.
Artificial intelligence as an approach to visual inspection in industrial applications has been considered for decades. Recent successes, driven by advances in deep learning, present a potential paradigm shift and have the potential to facilitate an automated visual inspection, even under complex environmental conditions. Thereby, convolutional neural networks (CNN) have been the de facto standard in deep-learning-based computer vision (CV) for the last 10 years. Recently, attention-based vision transformer architectures emerged and surpassed the performance of CNNs on benchmark datasets, regarding regular CV tasks, such as image classification, object detection, or segmentation. Nevertheless, despite their outstanding results, the application of vision transformers to real world visual inspection is sparse. We suspect that this is likely due to the assumption that they require enormous amounts of data to be effective. In this study, we evaluate this assumption. For this, we perform a systematic comparison of seven widely-used state-of-the-art CNN and transformer based architectures trained in three different use cases in the domain of visual damage assessment for railway freight car maintenance. We show that vision transformer models achieve at least equivalent performance to CNNs in industrial applications with sparse data available, and significantly surpass them in increasingly complex tasks.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available