☆ 4.7 Article

Deep Multimodal Fusion: A Hybrid Approach

INTERNATIONAL JOURNAL OF COMPUTER VISION (2018)

Journal

INTERNATIONAL JOURNAL OF COMPUTER VISION

Volume 126, Issue 2-4, Pages 440-456

Publisher

SPRINGER

DOI: 10.1007/s11263-017-0997-7

Keywords

Deep learning; Conditional Restricted Boltzmann Machines; Hybrid; Generative; Discriminative; Multimodal fusion; Gesture recognition; Social interaction modeling

Funding

DARPA [W911NF-12-C-0001]
Air Force Research Laboratory (AFRL)

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

We propose a novel hybrid model that exploits the strength of discriminative classifiers along with the representation power of generative models. Our focus is on detecting multimodal events in time varying sequences as well as generating missing data in any of the modalities. Discriminative classifiers have been shown to achieve higher performances than the corresponding generative likelihood-based classifiers. On the other hand, generative models learn a rich informative space which allows for data generation and joint feature representation that discriminative models lack. We propose a new model that jointly optimizes the representation space using a hybrid energy function. We employ a Restricted Boltzmann Machines (RBMs) based model to learn a shared representation across multiple modalities with time varying data. The Conditional RBMs (CRBMs) is an extension of the RBM model that takes into account short term temporal phenomena. The hybrid model involves augmenting CRBMs with a discriminative component for classification. For these purposes we propose a novel Multimodal Discriminative CRBMs (MMDCRBMs) model. First, we train the MMDCRBMs model using labeled data by training each modality, followed by training a fusion layer. Second, we exploit the generative capability of MMDCRBMs to activate the trained model so as to generate the lower-level data corresponding to the specific label that closely matches the actual input data. We evaluate our approach on ChaLearn dataset, audio-mocap, as well as the Tower Game dataset, mocap-mocap as well as three multimodal toy datasets. We report classification accuracy, generation accuracy, and localization accuracy and demonstrate its superiority compared to the state-of-the-art methods.

Deep Multimodal Fusion: A Hybrid Approach

Journal

INTERNATIONAL JOURNAL OF COMPUTER VISION

Publisher

SPRINGER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Deep Multimodal Fusion: A Hybrid Approach

Journal

INTERNATIONAL JOURNAL OF COMPUTER VISION

Publisher

SPRINGER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper