☆ 4.7 Article

Multimodal Spatiotemporal Representation for Automatic Depression Level Detection

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING (2023)

Journal

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING

Volume 14, Issue 1, Pages 294-307

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TAFFC.2020.3031345

Keywords

Feature extraction; Depression; Two dimensional displays; Spatiotemporal phenomena; Databases; Three-dimensional displays; Image segmentation; Multimodal depression detection; spatio-temporal attention; audio; video segment-level feature; eigen evolution pooling; video level feature; multimodal attention feature fusion

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Physiological studies indicate differences in speech and facial activities between depressive and healthy individuals. To predict the individual depression level, this study proposes a spatio-temporal attention network and a multimodal attention feature fusion strategy. The approach integrates spatial and temporal information and emphasizes audio/video frames related to depression detection. Experimental results demonstrate the effectiveness of the proposed method on depression databases.

Physiological studies have shown that there are some differences in speech and facial activities between depressive and healthy individuals. Based on this fact, we propose a novel spatio-temporal attention (STA) network and a multimodal attention feature fusion (MAFF) strategy to obtain the multimodal representation of depression cues for predicting the individual depression level. Specifically, we first divide the speech amplitude spectrum/video into fixed-length segments and input these segments into the STA network, which not only integrates the spatial and temporal information through attention mechanism, but also emphasizes the audio/video frames related to depression detection. The audio/video segment-level feature is obtained from the output of the last full connection layer of the STA network. Second, this article employs the eigen evolution pooling method to summarize the changes of each dimension of the audio/video segment-level features to aggregate them into the audio/video level feature. Third, the multimodal representation with modal complementary information is generated using the MAFF and inputs into the support vector regression predictor for estimating depression severity. Experimental results on the AVEC2013 and AVEC2014 depression databases illustrate the effectiveness of our method.

Multimodal Spatiotemporal Representation for Automatic Depression Level Detection

Journal

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Multimodal Spatiotemporal Representation for Automatic Depression Level Detection

Journal

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper