☆ 4.5 Article

From Missing Data Imputation to Data Generation

JOURNAL OF COMPUTATIONAL SCIENCE (2022)

期刊

JOURNAL OF COMPUTATIONAL SCIENCE

卷 61, 期 -, 页码 -

出版社

ELSEVIER

DOI: 10.1016/j.jocs.2022.101640

关键词

Tabular Data; Missing Data; Data Imputation; Data Generation; Generative Adversarial Networks (GANs)

类别

Computer Science, Interdisciplinary Applications Computer Science, Theory & Methods

资金

Charite-Universitaetsmedizin Berlin, Germany
Berlin Institute of Health, Germany
German Research Foundation (DFG)
FCT - Fundacao para a Ciencia e Tecnologia, Portugal [UIDB/00319/2020]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper introduces three novel data imputation methods based on Generative Adversarial Networks (GAN), which play an important role in generating synthetic datasets. It also investigates how data imputation methods can help mitigate legal, ethical, and data privacy issues, as well as augment original data. Through experimental evaluation, it is found that these new methods perform well in incomplete data imputation.

Real datasets often lack values, compromising the quality of data analyses. Adequate data may be synthetically imputed to replace missing values - a technique known as missing data imputation - avoiding deletion of incomplete observations. Several data imputation methods have been proposed and generative methods based on Artificial Neural Networks (ANN) are successful alternatives to discriminative methods. In this extended version of our work presented at the International Conference on Computational Science Neves et al. (2021), we propose three novel data imputation methods based on Generative Adversarial Networks (GAN): SGAIN, WSGAIN-CP, and WSGAIN-GP.We further studied how data imputation methods can be used to generate fully synthetic datasets. Among other benefits, the generation of synthetic data can help to mitigate legal, ethical, and data privacy issues, as well as to augment original data. In this context, we introduce tabulator, which is a novel meta-method for synthetic data generation that uses the data imputation methods as back-end engines for tabular data generation.We evaluated our data imputation methods using datasets with different amputation rates following the Missing Completely At Random (MCAR) setting. The results show that our methods are en-par or outperform state-of-the-art imputation methods in terms of response time and the quality of imputed data. We further evaluated and compared our data generation methods, which were derived from tabulator, with a state-ofthe-art approach, the Conditional Tabular GAN (CTGAN). The evaluation results show that our tabulator methods outperform CTGAN in many cases, for example regarding the accuracy of machine learning tasks (e.g., prediction or classification) performed on the synthetic output data.

From Missing Data Imputation to Data Generation

期刊

JOURNAL OF COMPUTATIONAL SCIENCE

出版社

ELSEVIER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

From Missing Data Imputation to Data Generation

期刊

JOURNAL OF COMPUTATIONAL SCIENCE

出版社

ELSEVIER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文