4.7 Article

MultiD-CNN: A multi-dimensional feature learning approach based on deep convolutional networks for gesture recognition in RGB-D image sequences

期刊

EXPERT SYSTEMS WITH APPLICATIONS
卷 139, 期 -, 页码 -

出版社

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.eswa.2019.112829

关键词

Gesture recognition; Deep learning; Convolutional neural networks; Multimodal learning; Feature fusion; RGB-D video processing

资金

  1. Centre National pour la Recherche Scientifique et Technique (CNRST) - Moroccan government [14UIZ2015]
  2. PPR2-2015 project

向作者/读者索取更多资源

Human gesture recognition has become a pillar of today's intelligent Human-Computer Interfaces as it typically provides more comfortable and ubiquitous interaction. Such expert system has a promising prospect in various applications, including smart houses, gaming, healthcare, and robotics. However, recognizing human gestures in videos is one of the most challenging topics in computer vision, because of some irrelevant environmental factors like complex background, occlusion, lighting conditions, and so on. With the recent development of deep learning, many researchers have addressed this problem by building single deep networks to learn spatiotemporal features from video data. However, the performance is still unsatisfactory due to the limitation that the single deep networks are incapable of handling these challenges simultaneously. Hence, the extracted features cannot efficiently capture both relevant shape information and detailed spatiotemporal variation of the gestures. One solution to overcome the aforementioned drawbacks is to fuse multiple features from different models learned on multiple vision cues. Aiming at this objective, we present in this paper an effective multi-dimensional feature learning approach, termed as MultiD-CNN, for human gesture recognition in RGB-D videos. The key to our design is to learn high-level gesture representations by taking advantages from Convolutional Residual Networks (ResNets) for training extremely deep models and Convolutional Long Short-Term Memory Networks (ConvLSTM) for dealing with time-series connections. More specifically, we first construct an architecture to simultaneously learn the spatiotemporal features from RGB and depth sequences through 3D ResNets which are then linked to a ConvLSTM to capture the temporal dependencies between them, and we show that they better combine appearance and motion information effectively. Second, to alleviate distractions from background and other variations, we propose a method that encodes the temporal information into a motion representation, while a two-stream architecture based on 2D-ResNets is then employed to extract deep features from this representation. Third, we investigate different fusion strategies at different levels for blending the classification results, and we show that integrating multiple ways of encoding the spatial and temporal information leads to a robust and stable spatiotemporal feature learning with better generalization capability. Finally, we perform different experiments to evaluate the performance of the investigated architectures on four kinds of challenging datasets, demonstrating that our approach is particularly impressive where it outperforms prior arts in both accuracy and efficiency. The obtained results affirm also the importance of embedding the proposed approach in other intelligent systems application areas. (C) 2019 Elsevier Ltd. All rights reserved.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据