4.7 Article

DPTVAE: Data-driven prior-based tabular variational autoencoder for credit data synthesizing

期刊

EXPERT SYSTEMS WITH APPLICATIONS
卷 241, 期 -, 页码 -

出版社

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.eswa.2023.122071

关键词

DPTVAE; Data synthesis; Deep learning; Credit data; Privacy protection

向作者/读者索取更多资源

The article introduces a method using a data-driven prior-based tabular variational autoencoder (DPTVAE) to synthesize credit data. The DPTVAE effectively addresses the challenges in credit data synthesis and demonstrates exceptional synthesis performance, particularly in identifying real default users based on synthetic data.
Data synthesizing is of great significance for the privacy protection of real credit data. Credit data synthesis poses unique challenges, involving discrete and continuous features, lack of prior information, high feature complexity, and imbalance. To address the challenge, we propose a data-driven prior-based tabular variational autoencoder (DPTVAE) to end-to-end synthesize credit data, without any expert experience. It mainly contains three main innovations: 1) Binning Gaussian probability density (BGPD)-based feature type classification. Previous work relies on expert-experience classification, which is limited and possibly missing. We innovatively propose BGPD-based class values importance calculation to automatically classify discrete continuous columns, so as to effectively facilitate the rational synthesis requirement of values or distributions. 2) Encoding based on BGPD-Variational Gaussian Mixture (BGPD-VGM): Continuous columns of financial data usually involve skewed, multi-peaks, or mixture distributions. To adapt to the complexity of the distribution, we propose BGPD-VGM to encode data-driven prior. 3) Conditional decoding: We also designed a conditional decoding strategy for DPTVAE to synthesize imbalanced discrete columns. Compared to seven existing advanced models, DPTVAE demonstrates exceptional synthesis performance on two datasets with a 33-fold difference in data size, partic-ularly in identifying real default users based on synthetic data. This achievement is significant for data appli-cations based on privacy protection. The code in this work could be found in https://github.com/jinxtan/ DPTVAE.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据