4.5 Article

From Missing Data Imputation to Data Generation

期刊

JOURNAL OF COMPUTATIONAL SCIENCE
卷 61, 期 -, 页码 -

出版社

ELSEVIER
DOI: 10.1016/j.jocs.2022.101640

关键词

Tabular Data; Missing Data; Data Imputation; Data Generation; Generative Adversarial Networks (GANs)

资金

  1. Charite-Universitaetsmedizin Berlin, Germany
  2. Berlin Institute of Health, Germany
  3. German Research Foundation (DFG)
  4. FCT - Fundacao para a Ciencia e Tecnologia, Portugal [UIDB/00319/2020]

向作者/读者索取更多资源

This paper introduces three novel data imputation methods based on Generative Adversarial Networks (GAN), which play an important role in generating synthetic datasets. It also investigates how data imputation methods can help mitigate legal, ethical, and data privacy issues, as well as augment original data. Through experimental evaluation, it is found that these new methods perform well in incomplete data imputation.
Real datasets often lack values, compromising the quality of data analyses. Adequate data may be synthetically imputed to replace missing values - a technique known as missing data imputation - avoiding deletion of incomplete observations. Several data imputation methods have been proposed and generative methods based on Artificial Neural Networks (ANN) are successful alternatives to discriminative methods. In this extended version of our work presented at the International Conference on Computational Science Neves et al. (2021), we propose three novel data imputation methods based on Generative Adversarial Networks (GAN): SGAIN, WSGAIN-CP, and WSGAIN-GP.We further studied how data imputation methods can be used to generate fully synthetic datasets. Among other benefits, the generation of synthetic data can help to mitigate legal, ethical, and data privacy issues, as well as to augment original data. In this context, we introduce tabulator, which is a novel meta-method for synthetic data generation that uses the data imputation methods as back-end engines for tabular data generation.We evaluated our data imputation methods using datasets with different amputation rates following the Missing Completely At Random (MCAR) setting. The results show that our methods are en-par or outperform state-of-the-art imputation methods in terms of response time and the quality of imputed data. We further evaluated and compared our data generation methods, which were derived from tabulator, with a state-ofthe-art approach, the Conditional Tabular GAN (CTGAN). The evaluation results show that our tabulator methods outperform CTGAN in many cases, for example regarding the accuracy of machine learning tasks (e.g., prediction or classification) performed on the synthetic output data.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据