期刊
出版社
ASSOC COMPUTING MACHINERY
DOI: 10.1145/3511808.3557710
关键词
cross-modal search; retrieval; computer vision
Texture BERT model describes visual attributes of texture using natural language, capturing rich details in texture images with compact bilinear pooling and enhancing matching effectiveness with self-attention transformer layers.
We propose Texture BERT, a model describing visual attributes of texture using natural language. To capture the rich details in texture images, we propose a group-wise compact bilinear pooling method, which represents the texture image by a set of visual patterns. The similarity between the texture image and the corresponding language description is determined by the cross-matching between the set of visual patterns from the texture image and the set of word features from the language description. We also exploit the self-attention transformer layers to provide the cross-modal context and enhance the effectiveness of matching. Our efforts achieve state-of-the-art accuracy on both text retrieval and image retrieval tasks, demonstrating the effectiveness of the proposed Texture BERT model in describing texture through natural language.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据