4.7 Article

Multi-Modal fusion with multi-level attention for Visual Dialog

期刊

出版社

ELSEVIER SCI LTD
DOI: 10.1016/j.ipm.2019.102152

关键词

Visual Dialog; Multi-Modal; Multi-Level; Attention mechanism

向作者/读者索取更多资源

Given an input image, Visual Dialog is introduced to answer a sequence of questions in the form of a dialog. To generate accurate answers for questions in the dialog, we need to consider all information of the dialog history, the question, and the image. However, existing methods usually directly utilized the high-level semantic information of the whole sentence for the dialog history and the question, while ignoring the low-level detailed information of words in the sentence. Similarly, the detailed region information of the image in low level is also required to be considered for question answering. Therefore, we propose a novel visual dialog method, which focuses on both high-level and low-level information of the dialog history, the question, and the image. In our approach, we introduce three low-level attention modules, the goal of which is to enhance the representation of words in the sentence of the dialog history and the question based on the word-to-word connection and enrich the region information of the image based on the region-to-region relation. Besides, we design three high-level attention modules to select important words in the sentence of the dialog history and the question as the supplement of the detailed information for semantic understanding, as well as to select relevant regions in the image to provide the targeted visual information for question answering. We evaluate the proposed approach on two datasets: VisDial v0.9 and VisDial v1.0. The experimental results demonstrate that utilizing both low-level and high-level information really enhances the representation of inputs.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据