4.6 Article

Micro Expression Recognition Using Convolution Patch in Vision Transformer

期刊

IEEE ACCESS
卷 11, 期 -, 页码 100495-100507

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/ACCESS.2023.3314797

关键词

& nbsp;Facial expression recognition; deep learning; micro-expression recognition; self-attention; vision transformer

向作者/读者索取更多资源

This paper proposes a vision transformer based on convolution patches for the classification of micro-expressions, which combines convolutional layers and transformers to capture both spatial information and global dependencies, leading to improved performance.
Humans possess an intrinsic ability to hide their true emotions. Micro-expressions are subtle changes in facial muscles that are involuntary by nature and easy to hide. To address these issues, several machine and deep learning models have been proposed in the past few years. Convolution neural network (CNN) is a deep learning method that has widely been adopted in vision-related tasks due to its remarkable performance. However, CNN suffers from overfitting due to a large number of trainable parameters. Additionally, CNN cannot capture global information with respect to an input image. Furthermore, the identification of important regions for the classification of micro-expressions is a challenging task. Self-attention mechanism addresses these issues by focusing on key areas. Furthermore, specific transformers, known as vision transformers are widely explored in vision-related applications. However, existing vision transformers divide an input image into a fixed number of patches due to which local correlation of image pixels is lost. Further, a vision transformer relies on self-attention mechanism which effectively captures global dependencies but does not exploit the local spatial relationships in an image. In this work, we propose a vision transformer based on convolution patches to overcome this problem. The proposed algorithm generates $c $ number of feature maps from input images using $c $ filters through convolution operation. These feature maps are then applied to a transformer model as fixed-size image patches to perform classification. Thus, the proposed architecture leverages advantages of both convolutional layers and transformer, and captures both spatial information and global dependencies respectively, leading to improved performance. The performance of the proposed model is evaluated on three benchmark datasets: CASME-I, CASME-II, and SAMM and compared with state-of-the-art machine and deep learning models, which generated classification accuracy of 95.97%, 98.59%, and 100%, respectively.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据