3.8 Proceedings Paper

FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback

出版社

IEEE COMPUTER SOC
DOI: 10.1109/CVPR52688.2022.01371

关键词

-

向作者/读者索取更多资源

This study proposes a new vision-language transformer based model, FashionVLP, for fashion image retrieval. The model utilizes prior knowledge from large image-text corpora and combines visual information from multiple levels of context to effectively capture fashion-related information. The results show that FashionVLP achieves state-of-the-art performance on benchmark datasets and a significant improvement on the challenging FashionlQ dataset with complex natural language feedback.
Fashion image retrieval based on a query pair of reference image and natural language feedback is a challenging task that requires models to assess fashion related information from visual and textual modalities simultaneously. We propose a new vision-language transformer based model, FashionVLP, that brings the prior knowledge contained in large image-text corpora to the domain of fashion image retrieval, and combines visual information from multiple levels of context to effectively capture fashion-related information.While queries are encoded through the transformer layers, our asymmetric design adopts a novel attention-based approach for fusing target image features without involving text or transformer layers in the process. Extensive results show that FashionVLP achieves the state-of-the-art performance on benchmark datasets, with a large 23% relative improvement on the challenging FashionlQ dataset, which contains complex natural language feedback.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

3.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据