☆ 3.8 Proceedings Paper

MUTAN: Multimodal Tucker Fusion for Visual Question Answering

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV) (2017)

期刊

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV)

卷 -, 期 -, 页码 2631-2639

出版社

IEEE

DOI: 10.1109/ICCV.2017.285

关键词

类别

Computer Science, Artificial Intelligence Engineering, Electrical & Electronic

资金

Labex SMART
French state funds [ANR-11-LABX-65]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Bilinear models provide an appealing framework for mixing and merging information in Visual Question Answering (VQA) tasks. They help to learn high level associations between question meaning and visual concepts in the image, but they suffer from huge dimensionality issues. We introduce MUTAN, a multimodal tensor-based Tucker decomposition to efficiently parametrize bilinear interactions between visual and textual representations. Additionally to the Tucker framework, we design a low-rank matrix-based decomposition to explicitly constrain the interaction rank. With MUTAN, we control the complexity of the merging scheme while keeping nice interpretable fusion relations. We show how the Tucker decomposition framework generalizes some of the latest VQA architectures, providing state-of-the-art results.

MUTAN: Multimodal Tucker Fusion for Visual Question Answering

期刊

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV)

出版社

IEEE

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

MUTAN: Multimodal Tucker Fusion for Visual Question Answering

期刊

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV)

出版社

IEEE

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文