4.0 Article

Learning Explainable Disentangled Representations of E-Commerce Data by Aligning Their Visual and Textual Attributes

Journal

COMPUTERS
Volume 11, Issue 12, Pages -

Publisher

MDPI
DOI: 10.3390/computers11120182

Keywords

explainability; disentangled representation; multimodal representation; cross-modal search; outfit recommendation

Funding

  1. KU Leuven Postdoctoral Mandate grant [3E210691]
  2. ERC Advanced Grant CALCULUS H2020 [ERC-2017-ADG 788506]

Ask authors/readers for more resources

Understanding multimedia content in e-commerce is challenging, and disentangled representation learning is a promising approach. In this study, an explainable variational autoencoder framework (E-VAE) is proposed to obtain disentangled item representations by jointly learning visual and textual data. With the automatic interpretation mechanism, E-VAE provides insight into the quality of the disentanglement. Experimental results demonstrate the effectiveness of the proposed framework in outfit recommendation and cross-modal search tasks.
Understanding multimedia content remains a challenging problem in e-commerce search and recommendation applications. It is difficult to obtain item representations that capture the relevant product attributes since these product attributes are fine-grained and scattered across product images with huge visual variations and product descriptions that are noisy and incomplete. In addition, the interpretability and explainability of item representations have become more important in order to make e-commerce applications more intelligible to humans. Multimodal disentangled representation learning, where the independent generative factors of multimodal data are identified and encoded in separate subsets of features in the feature space, is an interesting research area to explore in an e-commerce context given the benefits of the resulting disentangled representations such as generalizability, robustness and interpretability. However, the characteristics of real-word e-commerce data, such as the extensive visual variation, noisy and incomplete product descriptions, and complex cross-modal relations of vision and language, together with the lack of an automatic interpretation method to explain the contents of disentangled representations, means that current approaches for multimodal disentangled representation learning do not suffice for e-commerce data. Therefore, in this work, we design an explainable variational autoencoder framework (E-VAE) which leverages visual and textual item data to obtain disentangled item representations by jointly learning to disentangle the visual item data and to infer a two-level alignment of the visual and textual item data in a multimodal disentangled space. As such, E-VAE tackles the main challenges in disentangling multimodal e-commerce data. Firstly, with the weak supervision of the two-level alignment our E-VAE learns to steer the disentanglement process towards discovering the relevant factors of variations in the multimodal data and to ignore irrelevant visual variations which are abundant in e-commerce data. Secondly, to the best of our knowledge our E-VAE is the first VAE-based framework that has an automatic interpretation mechanism that allows to explain the components of the disentangled item representations with text. With our textual explanations we provide insight in the quality of the disentanglement. Furthermore, we demonstrate that with our explainable disentangled item representations we achieve state-of-the-art outfit recommendation results on the Polyvore Outfits dataset and report new state-of-the-art cross-modal search results on the Amazon Dresses dataset.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.0
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available