☆ 4.0 Article

Learning Explainable Disentangled Representations of E-Commerce Data by Aligning Their Visual and Textual Attributes

COMPUTERS (2022)

期刊

COMPUTERS

卷 11, 期 12, 页码 -

出版社

MDPI

DOI: 10.3390/computers11120182

关键词

explainability; disentangled representation; multimodal representation; cross-modal search; outfit recommendation

类别

Computer Science, Interdisciplinary Applications

资金

KU Leuven Postdoctoral Mandate grant [3E210691]
ERC Advanced Grant CALCULUS H2020 [ERC-2017-ADG 788506]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Understanding multimedia content in e-commerce is challenging, and disentangled representation learning is a promising approach. In this study, an explainable variational autoencoder framework (E-VAE) is proposed to obtain disentangled item representations by jointly learning visual and textual data. With the automatic interpretation mechanism, E-VAE provides insight into the quality of the disentanglement. Experimental results demonstrate the effectiveness of the proposed framework in outfit recommendation and cross-modal search tasks.

Understanding multimedia content remains a challenging problem in e-commerce search and recommendation applications. It is difficult to obtain item representations that capture the relevant product attributes since these product attributes are fine-grained and scattered across product images with huge visual variations and product descriptions that are noisy and incomplete. In addition, the interpretability and explainability of item representations have become more important in order to make e-commerce applications more intelligible to humans. Multimodal disentangled representation learning, where the independent generative factors of multimodal data are identified and encoded in separate subsets of features in the feature space, is an interesting research area to explore in an e-commerce context given the benefits of the resulting disentangled representations such as generalizability, robustness and interpretability. However, the characteristics of real-word e-commerce data, such as the extensive visual variation, noisy and incomplete product descriptions, and complex cross-modal relations of vision and language, together with the lack of an automatic interpretation method to explain the contents of disentangled representations, means that current approaches for multimodal disentangled representation learning do not suffice for e-commerce data. Therefore, in this work, we design an explainable variational autoencoder framework (E-VAE) which leverages visual and textual item data to obtain disentangled item representations by jointly learning to disentangle the visual item data and to infer a two-level alignment of the visual and textual item data in a multimodal disentangled space. As such, E-VAE tackles the main challenges in disentangling multimodal e-commerce data. Firstly, with the weak supervision of the two-level alignment our E-VAE learns to steer the disentanglement process towards discovering the relevant factors of variations in the multimodal data and to ignore irrelevant visual variations which are abundant in e-commerce data. Secondly, to the best of our knowledge our E-VAE is the first VAE-based framework that has an automatic interpretation mechanism that allows to explain the components of the disentangled item representations with text. With our textual explanations we provide insight in the quality of the disentanglement. Furthermore, we demonstrate that with our explainable disentangled item representations we achieve state-of-the-art outfit recommendation results on the Polyvore Outfits dataset and report new state-of-the-art cross-modal search results on the Amazon Dresses dataset.

Learning Explainable Disentangled Representations of E-Commerce Data by Aligning Their Visual and Textual Attributes

期刊

COMPUTERS

出版社

MDPI

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Learning Explainable Disentangled Representations of E-Commerce Data by Aligning Their Visual and Textual Attributes

期刊

COMPUTERS

出版社

MDPI

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文