☆ 4.6 Article

DataSifterText: Partially Synthetic Text Generation for Sensitive Clinical Notes

JOURNAL OF MEDICAL SYSTEMS (2022)

期刊

JOURNAL OF MEDICAL SYSTEMS

卷 46, 期 12, 页码 -

出版社

SPRINGER

DOI: 10.1007/s10916-022-01880-6

关键词

PHI; Data science; Clinical notes; AI; ML; Synthetic data

类别

Health Care Sciences & Services Medical Informatics

资金

National Science Foundation [1916425, 1734853, 1636840, 1416953, 0716055, 1023115]
National Institutes of Health
Michigan Institute for Data Science
National Institute of Health [P20 NR015331, U54 EB020406, P50 NS091856, P30 DK089503, UL1 TR002240, R01 CA233487, R01 MH121079, R01 MH126137, T32 GM141746]
Direct For Computer & Info Scie & Enginr [1916425, 1636840] Funding Source: National Science Foundation
Direct For Education and Human Resources [1023115, 1416953] Funding Source: National Science Foundation
Direct For Social, Behav & Economic Scie
Division Of Behavioral and Cognitive Sci [1734853] Funding Source: National Science Foundation
Division Of Undergraduate Education [1416953, 1023115] Funding Source: National Science Foundation
Office of Advanced Cyberinfrastructure (OAC) [1636840, 1916425] Funding Source: National Science Foundation

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This article introduces a new method called DataSifterText, which can generate partially synthetic clinical free-text and provides high utility preservation while protecting privacy. Experiments have shown that this method is superior to traditional content suppression methods in terms of privacy protection and information preservation.

Petabytes of health data are collected annually across the globe in electronic health records (EHR), including significant information stored as unstructured free text. However, the lack of effective mechanisms to securely share clinical text has inhibited its full utilization. We propose a new method, DataSifterText, to generate partially synthetic clinical free-text that can be safely shared between stakeholders (e.g., clinicians, STEM researchers, engineers, analysts, and healthcare providers), limiting the re-identification risk while providing significantly better utility preservation than suppressing or generalizing sensitive tokens. The method creates partially synthetic free-text data, which inherits the joint population distribution of the original data, and disguises the location of true and obfuscated words. Under certain obfuscation levels, the resulting synthetic text was sufficiently altered with different choices, orders, and frequencies of words compared to the original records. The differences were comparable to machine-generated (fully synthetic) text reported in previous studies. We applied DataSifterText to two medical case studies. In the CDC work injury application, using privacy protection, 60.9-86.5% of the synthetic descriptions belong to the same cluster as the original descriptions, demonstrating better utility preservation than the naive content suppressing method (45.8-85.7%). In the MIMIC III application, the generated synthetic data maintained over 80% of the original information regarding patients' overall health conditions. The reported DataSifterText statistical obfuscation results indicate that the technique provides sufficient privacy protection (low identification risk) while preserving population-level information (high utility).

DataSifterText: Partially Synthetic Text Generation for Sensitive Clinical Notes

期刊

JOURNAL OF MEDICAL SYSTEMS

出版社

SPRINGER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

DataSifterText: Partially Synthetic Text Generation for Sensitive Clinical Notes

期刊

JOURNAL OF MEDICAL SYSTEMS

出版社

SPRINGER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文