☆ 4.5 Article

Emotion Recognition With Multimodal Transformer Fusion Framework Based on Acoustic and Lexical Information

IEEE MULTIMEDIA (2022)

Journal

IEEE MULTIMEDIA

Volume 29, Issue 2, Pages 94-103

Publisher

IEEE COMPUTER SOC

DOI: 10.1109/MMUL.2022.3161411

Keywords

Acoustics; Feature extraction; Transformers; Emotion recognition; Linguistics; Data mining; Speech recognition

Funding

National Key R&D Program of China [2018YFB1305200]
National Natural Science Foundation of China [62176182, 61976216]
Tianjin Municipal Science and Technology Project [19ZXZNGX00030]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This article proposes an implicitly aligned multimodal transformer fusion framework based on acoustic features and text information for emotion recognition. The model allows two modalities to guide and complement each other, and uses weighted fusion to control the contributions of different modalities, thereby obtaining more complementary emotional representations. Experiments have shown that this method outperforms baseline methods.

People usually express emotions through paralinguistic and linguistic information in speech. How to effectively integrate linguistic and paralinguistic information for emotion recognition is a challenge. Previous studies have adopted the bidirectional long short-term memory (BLSTM) network to extract acoustic and lexical representations followed by a concatenate layer, and this has become a common method. However, the interaction and influence between different modalities are difficult to promote using simple feature fusion for each sentence. In this article, we propose an implicitly aligned multimodal transformer fusion (IA-MMTF) framework based on acoustic features and text information. This model enables the two modalities to guide and complement each other when learning emotional representations. Thereafter, the weighed fusion is used to control the contributions of different modalities. Thus, we can obtain more complementary emotional representations. Experiments on the interactive emotional dyadic motion capture (IEMOCAP) database and multimodal emotionlines dataset (MELD) show that the proposed method outperforms the baseline BLSTM-based method.

Emotion Recognition With Multimodal Transformer Fusion Framework Based on Acoustic and Lexical Information

Journal

IEEE MULTIMEDIA

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Emotion Recognition With Multimodal Transformer Fusion Framework Based on Acoustic and Lexical Information

Journal

IEEE MULTIMEDIA

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper