4.7 Article

Multi-Granularity Aggregation Transformer for Joint Video-Audio-Text Representation Learning

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TCSVT.2022.3225549

关键词

Multimodal representation learning; attention mechanism; multi-granularity aggregation; video-paragraph retrieval; video captioning

向作者/读者索取更多资源

In this paper, a Multi-Granularity Aggregation Transformer (MGAT) is proposed for joint video-audio-text representation learning. The method overcomes the limitations of existing methods by designing a multi-granularity transformer module and an attention-guided aggregation module. The aggregated information is aligned with text information at different hierarchical levels using consistency loss and contrastive loss. Experimental results demonstrate the superiority of the proposed method on tasks such as video-paragraph retrieval and video captioning.
Many real-world video-text tasks involve different levels of granularity to represent local and global information with distinct semantics, such as frames and words, clips and sentences, or videos and paragraphs. Most existing multimodal representation learning methods suffer from limitations: (i) Adopting expert systems or manual design to extract more fine-grained local information (such as objects and actions in a video frame) for supervision may lead to information asymmetry since there may no corresponding information among modalities; (ii) Neglecting the hierarchical nature of the data to aggregate different levels of information from different modalities will cause insufficient representations. To alleviate the above issues, in this paper, we propose a Multi-Granularity Aggregation Transformer (MGAT) for joint video-audio-text representation learning. Specifically, for intra-modality, we first design a multi-granularity transformer module to relieve information asymmetry by making full use of local and global information within a single modality from different perspectives. Then, for inter-modality, we develop an attention-guided aggregation module to fuse audio and video information hierarchically. Last, we align the aggregated information with text information at different hierarchical levels via intra- and inter-modality consistency loss and contrastive loss. With the help of more granularity of information, we are able to obtain a well-performed representation model for a variety of tasks, e.g., video-paragraph retrieval and video captioning. Extensive experiments on two challenging benchmarks, i.e., ActivityNet-captions and Youcook2, demonstrate the superiority of our proposed method.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据