☆ 4.1 Article

Masked Vision-language Transformer in Fashion

MACHINE INTELLIGENCE RESEARCH (2023)

期刊

MACHINE INTELLIGENCE RESEARCH

卷 20, 期 3, 页码 421-434

出版社

SPRINGERNATURE

DOI: 10.1007/s11633-022-1394-4

关键词

Vision-language; masked image reconstruction; transformer; fashion; e-commercial

类别

Automation & Control Systems Computer Science, Artificial Intelligence

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper presents a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation, which replaces the bidirectional encoder representations from Transformers (BERT) with the vision transformer architecture. It is the first end-to-end framework for the fashion domain and includes masked image reconstruction (MIR) for fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that can handle raw multi-modal inputs without extra pre-processing models and shows improvements in retrieval and recognition tasks compared to Kaleido-BERT, the Fashion-Gen 2018 winner.

We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers (BERT) in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner, Kaleido-BERT. The code is available at https://github.com/GewelsJI/MVLT.

Masked Vision-language Transformer in Fashion

期刊

MACHINE INTELLIGENCE RESEARCH

出版社

SPRINGERNATURE

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Masked Vision-language Transformer in Fashion

期刊

MACHINE INTELLIGENCE RESEARCH

出版社

SPRINGERNATURE

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文