4.6 Article

DataSifterText: Partially Synthetic Text Generation for Sensitive Clinical Notes

期刊

JOURNAL OF MEDICAL SYSTEMS
卷 46, 期 12, 页码 -

出版社

SPRINGER
DOI: 10.1007/s10916-022-01880-6

关键词

PHI; Data science; Clinical notes; AI; ML; Synthetic data

资金

  1. National Science Foundation [1916425, 1734853, 1636840, 1416953, 0716055, 1023115]
  2. National Institutes of Health
  3. Michigan Institute for Data Science
  4. National Institute of Health [P20 NR015331, U54 EB020406, P50 NS091856, P30 DK089503, UL1 TR002240, R01 CA233487, R01 MH121079, R01 MH126137, T32 GM141746]
  5. Direct For Computer & Info Scie & Enginr [1916425, 1636840] Funding Source: National Science Foundation
  6. Direct For Education and Human Resources [1023115, 1416953] Funding Source: National Science Foundation
  7. Direct For Social, Behav & Economic Scie
  8. Division Of Behavioral and Cognitive Sci [1734853] Funding Source: National Science Foundation
  9. Division Of Undergraduate Education [1416953, 1023115] Funding Source: National Science Foundation
  10. Office of Advanced Cyberinfrastructure (OAC) [1636840, 1916425] Funding Source: National Science Foundation

向作者/读者索取更多资源

This article introduces a new method called DataSifterText, which can generate partially synthetic clinical free-text and provides high utility preservation while protecting privacy. Experiments have shown that this method is superior to traditional content suppression methods in terms of privacy protection and information preservation.
Petabytes of health data are collected annually across the globe in electronic health records (EHR), including significant information stored as unstructured free text. However, the lack of effective mechanisms to securely share clinical text has inhibited its full utilization. We propose a new method, DataSifterText, to generate partially synthetic clinical free-text that can be safely shared between stakeholders (e.g., clinicians, STEM researchers, engineers, analysts, and healthcare providers), limiting the re-identification risk while providing significantly better utility preservation than suppressing or generalizing sensitive tokens. The method creates partially synthetic free-text data, which inherits the joint population distribution of the original data, and disguises the location of true and obfuscated words. Under certain obfuscation levels, the resulting synthetic text was sufficiently altered with different choices, orders, and frequencies of words compared to the original records. The differences were comparable to machine-generated (fully synthetic) text reported in previous studies. We applied DataSifterText to two medical case studies. In the CDC work injury application, using privacy protection, 60.9-86.5% of the synthetic descriptions belong to the same cluster as the original descriptions, demonstrating better utility preservation than the naive content suppressing method (45.8-85.7%). In the MIMIC III application, the generated synthetic data maintained over 80% of the original information regarding patients' overall health conditions. The reported DataSifterText statistical obfuscation results indicate that the technique provides sufficient privacy protection (low identification risk) while preserving population-level information (high utility).

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据