☆ 3.8 Article

DataSifter II: Partially synthetic data sharing of sensitive information containing time-varying correlated observations

JOURNAL OF ALGORITHMS & COMPUTATIONAL TECHNOLOGY (2022)

期刊

JOURNAL OF ALGORITHMS & COMPUTATIONAL TECHNOLOGY

卷 16, 期 -, 页码 -

出版社

SAGE PUBLICATIONS LTD

DOI: 10.1177/17483026211065379

关键词

Data-sharing; electronic health records; longitudinal imputation; synthetic data generation; time-varying data; DataSifter

类别

Computer Science, Interdisciplinary Applications

资金

National Science Foundation (NSF) [1916425, 1734853, 1636840, 1416953, 0716055, 1023115]
National Institutes of Health (NIH) [U54 EB020406, P50 NS091856, P30 DK089503, UL1TR002240, R01CA233487, R01MH121079, R01MH126137, T32GM141746, P20 NR015331]
Direct For Computer & Info Scie & Enginr [1636840] Funding Source: National Science Foundation
Direct For Computer & Info Scie & Enginr
Office of Advanced Cyberinfrastructure (OAC) [1916425] Funding Source: National Science Foundation
Office of Advanced Cyberinfrastructure (OAC) [1636840] Funding Source: National Science Foundation

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This study presents a partially synthetic data generation technique for creating anonymized data archives that closely resemble the original sensitive data. This technique reduces the risk of re-identification while preserving the analytical value of the obfuscated data. It provides an automated tool for effective and collaborative analytics for large time-varying datasets containing sensitive information.

There is a significant public demand for rapid data-driven scientific investigations using aggregated sensitive information. However, many technical challenges and regulatory policies hinder efficient data sharing. In this study, we describe a partially synthetic data generation technique for creating anonymized data archives whose joint distributions closely resemble those of the original (sensitive) data. Specifically, we introduce the DataSifter technique for time-varying correlated data (DataSifter II), which relies on an iterative model-based imputation using generalized linear mixed model and random effects-expectation maximization tree. DataSifter II can be used to generate synthetic repeated measures data for testing and validating new analytical techniques. Compared to the multiple imputation method, DataSifter II application on simulated and real clinical data demonstrates that the new method provides extensive reduction of re-identification risk (data privacy) while preserving the analytical value (data utility) in the obfuscated data. The performance of the DataSifter II on a simulation involving 20% artificially missingness in the data, shows at least 80% reduction of the disclosure risk, compared to the multiple imputation method, without a substantial impact on the data analytical value. In a separate clinical data (Medical Information Mart for Intensive Care III) validation, a model-based statistical inference drawn from the original data agrees with an analogous analytical inference obtained using the DataSifter II obfuscated (sifted) data. For large time-varying datasets containing sensitive information, the proposed technique provides an automated tool for alleviating the barriers of data sharing and facilitating effective, advanced, and collaborative analytics.

DataSifter II: Partially synthetic data sharing of sensitive information containing time-varying correlated observations

期刊

JOURNAL OF ALGORITHMS & COMPUTATIONAL TECHNOLOGY

出版社

SAGE PUBLICATIONS LTD

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

DataSifter II: Partially synthetic data sharing of sensitive information containing time-varying correlated observations

期刊

JOURNAL OF ALGORITHMS & COMPUTATIONAL TECHNOLOGY

出版社

SAGE PUBLICATIONS LTD

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文