☆ 4.6 Article

Fake It Till You Make It: Guidelines for Effective Synthetic Data Generation

APPLIED SCIENCES-BASEL (2021)

期刊

APPLIED SCIENCES-BASEL

卷 11, 期 5, 页码 -

出版社

MDPI

DOI: 10.3390/app11052158

关键词

generative models; synthetic data; synthetic data utility; data privacy

类别

Chemistry, Multidisciplinary Engineering, Multidisciplinary Materials Science, Multidisciplinary Physics, Applied

资金

UAE University research grant UPAR [31T121]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Synthetic data provides a privacy-protecting mechanism for healthcare data, generating artificial datasets without identifiable information for safe sharing. The paper evaluates the impact of different synthetic data generation and usage settings on the utility of the data and models, aiming to provide insights into the best practices when working with synthetic data.

Synthetic data provides a privacy protecting mechanism for the broad usage and sharing of healthcare data for secondary purposes. It is considered a safe approach for the sharing of sensitive data as it generates an artificial dataset that contains no identifiable information. Synthetic data is increasing in popularity with multiple synthetic data generators developed in the past decade, yet its utility is still a subject of research. This paper is concerned with evaluating the effect of various synthetic data generation and usage settings on the utility of the generated synthetic data and its derived models. Specifically, we investigate (i) the effect of data pre-processing on the utility of the synthetic data generated, (ii) whether tuning should be applied to the synthetic datasets when generating supervised machine learning models, and (iii) whether sharing preliminary machine learning results can improve the synthetic data models. Lastly, (iv) we investigate whether one utility measure (Propensity score) can predict the accuracy of the machine learning models generated from the synthetic data when employed in real life. We use two popular measures of synthetic data utility, propensity score and classification accuracy, to compare the different settings. We adopt a recent mechanism for the calculation of propensity, which looks carefully into the choice of model for the propensity score calculation. Accordingly, this paper takes a new direction with investigating the effect of various data generation and usage settings on the quality of the generated data and its ensuing models. The goal is to inform on the best strategies to follow when generating and using synthetic data.

Fake It Till You Make It: Guidelines for Effective Synthetic Data Generation

期刊

APPLIED SCIENCES-BASEL

出版社

MDPI

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Fake It Till You Make It: Guidelines for Effective Synthetic Data Generation

期刊

APPLIED SCIENCES-BASEL

出版社

MDPI

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文