4.6 Article

Automatic Lip-Reading System Based on Deep Convolutional Neural Network and Attention-Based Long Short-Term Memory

Journal

APPLIED SCIENCES-BASEL
Volume 9, Issue 8, Pages -

Publisher

MDPI
DOI: 10.3390/app9081599

Keywords

virtual reality (VR); self-attention; automatic lip-reading; sensory input; deep learning

Funding

  1. National Natural Science Foundation of China [61571013]
  2. Beijing Natural Science Foundation of China [4143061]
  3. Science and Technology Development Program of Beijing Municipal Education Commission [KM201710009003]
  4. Great Wall Scholar Reserved Talent Program of North China University of Technology [NCUT2017XN018013]

Ask authors/readers for more resources

With the improvement of computer performance, virtual reality (VR) as a new way of visual operation and interaction method gives the automatic lip-reading technology based on visual features broad development prospects. In an immersive VR environment, the user's state can be successfully captured through lip movements, thereby analyzing the user's real-time thinking. Due to complex image processing, hard-to-train classifiers and long-term recognition processes, the traditional lip-reading recognition system is difficult to meet the requirements of practical applications. In this paper, the convolutional neural network (CNN) used to image feature extraction is combined with a recurrent neural network (RNN) based on attention mechanism for automatic lip-reading recognition. Our proposed method for automatic lip-reading recognition can be divided into three steps. Firstly, we extract keyframes from our own established independent database (English pronunciation of numbers from zero to nine by three males and three females). Then, we use the Visual Geometry Group (VGG) network to extract the lip image features. It is found that the image feature extraction results are fault-tolerant and effective. Finally, we compare two lip-reading models: (1) a fusion model with an attention mechanism and (2) a fusion model of two networks. The results show that the accuracy of the proposed model is 88.2% in the test dataset and 84.9% for the contrastive model. Therefore, our proposed method is superior to the traditional lip-reading recognition methods and the general neural networks.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available