☆ 4.7 Article

Image-Text Multimodal Emotion Classification via Multi-View Attentional Network

IEEE TRANSACTIONS ON MULTIMEDIA (2021)

Journal

IEEE TRANSACTIONS ON MULTIMEDIA

Volume 23, Issue -, Pages 4014-4026

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TMM.2020.3035277

Keywords

Sentiment analysis; Feature extraction; Task analysis; Analytical models; Visualization; Semantics; Social networking (online); Memory network; multi-view attention mechanism; social media; multimodal emotion analysis

Funding

National Key R&D Program of China [2018YFB1004700]
National Natural Science Foundation of China [61872074, 61772122]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper introduces a novel multimodal emotion analysis model based on the Multi-view Attentional Network (MVAN) that utilizes a memory network to obtain deep semantic features of image-text. The model consists of three stages: feature mapping, interactive learning, and feature fusion, and outperforms strong baseline models in experimental results on multiple datasets.

Compared with single-modal content, multimodal data can express users' feelings and sentiments more vividly and interestingly. Therefore, multimodal sentiment analysis has become a popular research topic. However, most existing methods either learn modal sentiment feature independently, without considering their correlations, or they simply integrate multimodal features. In addition, most publicly available multimodal datasets are labeled by sentiment polarities, while the emotions expressed by users are specific. Based on this observation, in this paper, we build a large-scale image-text emotion dataset (i.e., labeled by different emotions), called TumEmo, with more than 190,000 instances from Tumblr.(1) We further propose a novel multimodal emotion analysis model based on the Multi-view Attentional Network (MVAN), which utilizes a memory network that is continually updated to obtain the deep semantic features of image-text. The model includes three stages: feature mapping, interactive learning, and feature fusion. In the feature mapping stage, we leverage image features from an object viewpoint and a scene viewpoint to capture effective information for multimodal emotion analysis. Then, an interactive learning mechanism is adopted that uses the memory network; this mechanism extracts single-modal emotion features and interactively models the cross-view dependencies between the image and text. In the feature fusion stage, multiple features are deeply fused using a multilayer perceptron and a stacking-pooling module. The experimental results on the MVSA-Single, MVSA-Multiple, and TumEmo datasets show that the proposed MVAN outperforms strong baseline models by large margins.

Image-Text Multimodal Emotion Classification via Multi-View Attentional Network

Journal

IEEE TRANSACTIONS ON MULTIMEDIA

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Image-Text Multimodal Emotion Classification via Multi-View Attentional Network

Journal

IEEE TRANSACTIONS ON MULTIMEDIA

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper