4.0 Article

Learning Explainable Disentangled Representations of E-Commerce Data by Aligning Their Visual and Textual Attributes

期刊

COMPUTERS
卷 11, 期 12, 页码 -

出版社

MDPI
DOI: 10.3390/computers11120182

关键词

explainability; disentangled representation; multimodal representation; cross-modal search; outfit recommendation

资金

  1. KU Leuven Postdoctoral Mandate grant [3E210691]
  2. ERC Advanced Grant CALCULUS H2020 [ERC-2017-ADG 788506]

向作者/读者索取更多资源

Understanding multimedia content in e-commerce is challenging, and disentangled representation learning is a promising approach. In this study, an explainable variational autoencoder framework (E-VAE) is proposed to obtain disentangled item representations by jointly learning visual and textual data. With the automatic interpretation mechanism, E-VAE provides insight into the quality of the disentanglement. Experimental results demonstrate the effectiveness of the proposed framework in outfit recommendation and cross-modal search tasks.
Understanding multimedia content remains a challenging problem in e-commerce search and recommendation applications. It is difficult to obtain item representations that capture the relevant product attributes since these product attributes are fine-grained and scattered across product images with huge visual variations and product descriptions that are noisy and incomplete. In addition, the interpretability and explainability of item representations have become more important in order to make e-commerce applications more intelligible to humans. Multimodal disentangled representation learning, where the independent generative factors of multimodal data are identified and encoded in separate subsets of features in the feature space, is an interesting research area to explore in an e-commerce context given the benefits of the resulting disentangled representations such as generalizability, robustness and interpretability. However, the characteristics of real-word e-commerce data, such as the extensive visual variation, noisy and incomplete product descriptions, and complex cross-modal relations of vision and language, together with the lack of an automatic interpretation method to explain the contents of disentangled representations, means that current approaches for multimodal disentangled representation learning do not suffice for e-commerce data. Therefore, in this work, we design an explainable variational autoencoder framework (E-VAE) which leverages visual and textual item data to obtain disentangled item representations by jointly learning to disentangle the visual item data and to infer a two-level alignment of the visual and textual item data in a multimodal disentangled space. As such, E-VAE tackles the main challenges in disentangling multimodal e-commerce data. Firstly, with the weak supervision of the two-level alignment our E-VAE learns to steer the disentanglement process towards discovering the relevant factors of variations in the multimodal data and to ignore irrelevant visual variations which are abundant in e-commerce data. Secondly, to the best of our knowledge our E-VAE is the first VAE-based framework that has an automatic interpretation mechanism that allows to explain the components of the disentangled item representations with text. With our textual explanations we provide insight in the quality of the disentanglement. Furthermore, we demonstrate that with our explainable disentangled item representations we achieve state-of-the-art outfit recommendation results on the Polyvore Outfits dataset and report new state-of-the-art cross-modal search results on the Amazon Dresses dataset.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.0
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据