4.6 Article

DataSifterText: Partially Synthetic Text Generation for Sensitive Clinical Notes

Journal

JOURNAL OF MEDICAL SYSTEMS
Volume 46, Issue 12, Pages -

Publisher

SPRINGER
DOI: 10.1007/s10916-022-01880-6

Keywords

PHI; Data science; Clinical notes; AI; ML; Synthetic data

Funding

  1. National Science Foundation [1916425, 1734853, 1636840, 1416953, 0716055, 1023115]
  2. National Institutes of Health
  3. Michigan Institute for Data Science
  4. National Institute of Health [P20 NR015331, U54 EB020406, P50 NS091856, P30 DK089503, UL1 TR002240, R01 CA233487, R01 MH121079, R01 MH126137, T32 GM141746]
  5. Direct For Computer & Info Scie & Enginr [1916425, 1636840] Funding Source: National Science Foundation
  6. Direct For Education and Human Resources [1023115, 1416953] Funding Source: National Science Foundation
  7. Direct For Social, Behav & Economic Scie
  8. Division Of Behavioral and Cognitive Sci [1734853] Funding Source: National Science Foundation
  9. Division Of Undergraduate Education [1416953, 1023115] Funding Source: National Science Foundation
  10. Office of Advanced Cyberinfrastructure (OAC) [1636840, 1916425] Funding Source: National Science Foundation

Ask authors/readers for more resources

This article introduces a new method called DataSifterText, which can generate partially synthetic clinical free-text and provides high utility preservation while protecting privacy. Experiments have shown that this method is superior to traditional content suppression methods in terms of privacy protection and information preservation.
Petabytes of health data are collected annually across the globe in electronic health records (EHR), including significant information stored as unstructured free text. However, the lack of effective mechanisms to securely share clinical text has inhibited its full utilization. We propose a new method, DataSifterText, to generate partially synthetic clinical free-text that can be safely shared between stakeholders (e.g., clinicians, STEM researchers, engineers, analysts, and healthcare providers), limiting the re-identification risk while providing significantly better utility preservation than suppressing or generalizing sensitive tokens. The method creates partially synthetic free-text data, which inherits the joint population distribution of the original data, and disguises the location of true and obfuscated words. Under certain obfuscation levels, the resulting synthetic text was sufficiently altered with different choices, orders, and frequencies of words compared to the original records. The differences were comparable to machine-generated (fully synthetic) text reported in previous studies. We applied DataSifterText to two medical case studies. In the CDC work injury application, using privacy protection, 60.9-86.5% of the synthetic descriptions belong to the same cluster as the original descriptions, demonstrating better utility preservation than the naive content suppressing method (45.8-85.7%). In the MIMIC III application, the generated synthetic data maintained over 80% of the original information regarding patients' overall health conditions. The reported DataSifterText statistical obfuscation results indicate that the technique provides sufficient privacy protection (low identification risk) while preserving population-level information (high utility).

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available