4.7 Article

Image-Text Multimodal Emotion Classification via Multi-View Attentional Network

Journal

IEEE TRANSACTIONS ON MULTIMEDIA
Volume 23, Issue -, Pages 4014-4026

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TMM.2020.3035277

Keywords

Sentiment analysis; Feature extraction; Task analysis; Analytical models; Visualization; Semantics; Social networking (online); Memory network; multi-view attention mechanism; social media; multimodal emotion analysis

Funding

  1. National Key R&D Program of China [2018YFB1004700]
  2. National Natural Science Foundation of China [61872074, 61772122]

Ask authors/readers for more resources

This paper introduces a novel multimodal emotion analysis model based on the Multi-view Attentional Network (MVAN) that utilizes a memory network to obtain deep semantic features of image-text. The model consists of three stages: feature mapping, interactive learning, and feature fusion, and outperforms strong baseline models in experimental results on multiple datasets.
Compared with single-modal content, multimodal data can express users' feelings and sentiments more vividly and interestingly. Therefore, multimodal sentiment analysis has become a popular research topic. However, most existing methods either learn modal sentiment feature independently, without considering their correlations, or they simply integrate multimodal features. In addition, most publicly available multimodal datasets are labeled by sentiment polarities, while the emotions expressed by users are specific. Based on this observation, in this paper, we build a large-scale image-text emotion dataset (i.e., labeled by different emotions), called TumEmo, with more than 190,000 instances from Tumblr.(1) We further propose a novel multimodal emotion analysis model based on the Multi-view Attentional Network (MVAN), which utilizes a memory network that is continually updated to obtain the deep semantic features of image-text. The model includes three stages: feature mapping, interactive learning, and feature fusion. In the feature mapping stage, we leverage image features from an object viewpoint and a scene viewpoint to capture effective information for multimodal emotion analysis. Then, an interactive learning mechanism is adopted that uses the memory network; this mechanism extracts single-modal emotion features and interactively models the cross-view dependencies between the image and text. In the feature fusion stage, multiple features are deeply fused using a multilayer perceptron and a stacking-pooling module. The experimental results on the MVSA-Single, MVSA-Multiple, and TumEmo datasets show that the proposed MVAN outperforms strong baseline models by large margins.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available