期刊
PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17)
卷 -, 期 -, 页码 1698-1706出版社
ASSOC COMPUTING MACHINERY
DOI: 10.1145/3123266.3123369
关键词
Cross-media retrieval; rich semantic embeddings; multi-sensory fusion; TextNet
资金
- Shenzhen Peacock Plan [20130408183003656]
- Shenzhen Key Laboratory for Intelligent Multimedia and Virtual Reality [ZDSYS201703031405467]
- Guangdong Science and Technology Project [2014B010117007]
Cross-media retrieval aims at seeking the semantic association between different media types. Most existing methods paid much attention on learning mapping functions or finding the optimal spaces, but neglected how people accurately cognize images and texts. This paper proposes a brain inspired cross-media retrieval framework to learn rich semantic embeddings of multimedia. Different from directly using off-the-shelf image features, we combine the visual and descriptive senses for an image from the view of human perception via a joint model, called multi-sensory fusion network (MSFN). A topic model based TextNet maps texts into the same semantic space as images according to their shared ground truth labels. Moreover, in order to overcome the limitations of insufficient data for training neural networks and less complexity in text form, we introduce a large-scale image-text dataset, called Britannica dataset. Extensive experiments show the effectiveness of our framework for different lengths of texts on three benchmark datasets as well as Britannica dataset. Most of all, we report the best known average results of Img2Text and Text2Img compared with several state-of-the-art methods.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据