☆ 4.7 Article

Predicting Visual Features From Text for Image and Video Caption Retrieval

IEEE TRANSACTIONS ON MULTIMEDIA (2018)

期刊

IEEE TRANSACTIONS ON MULTIMEDIA

卷 20, 期 12, 页码 3377-3388

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TMM.2018.2832602

关键词

Image and video; caption retrieval

类别

Computer Science, Information Systems Computer Science, Software Engineering Telecommunications

资金

National Natural Science Foundation of China [61672523]
Fundamental Research Funds for the Central Universities
Research Funds of Renmin University of China [18XNLG19]
STW STORY project

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

This paper strives to find amidst a set of sentences the one best describing the content of a given image or video. Different from existing works, which rely on a joint subspace for their image and video caption retrieval, we propose to do so in a visual space exclusively. Apart from this conceptual novelty, we contribute Word2VisualVec, a deep neural network architecture that learns to predict a visual feature representation from textual input. Example captions are encoded into a textual embedding based on multiscale sentence vectorization and further transferred into a deep visual feature of choice via a simple multilayer perceptron. We further generalize Word2VisualVec for video caption retrieval, by predicting from text both three-dimensional convolutional neural network features as well as a visual-audio representation. Experiments on Flickr8k, Flickr30k, the Microsoft Video Description dataset, and the very recent NIST TrecVid challenge for video caption retrieval detail Word2VisualVec's properties, its benefit over textual embeddings, the potential for multimodal query composition, and its state-of-the-art results.

Predicting Visual Features From Text for Image and Video Caption Retrieval

期刊

IEEE TRANSACTIONS ON MULTIMEDIA

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Predicting Visual Features From Text for Image and Video Caption Retrieval

期刊

IEEE TRANSACTIONS ON MULTIMEDIA

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文