☆ 4.4 Article

Optimal vocabulary selection approaches for privacy-preserving deep NLP model training for information extraction and cancer epidemiology

CANCER BIOMARKERS (2022)

期刊

CANCER BIOMARKERS

卷 33, 期 2, 页码 185-198

出版社

IOS PRESS

DOI: 10.3233/CBM-210306

关键词

Privacy; privacy-preserving training; deep learning; natural language processing; cancer epidemiology; artificial intelligence

类别

Oncology

资金

National Nuclear Security Administration [17-SC-20-SC]
US Department of Energy (DOE) Office of Science [17-SC-20-SC]
Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program by DOE
NCI of the National Institutes of Health
DOE [DE-AC02-06-CH11357, DE-AC52-07NA27344, DE-AC5206NA25396, DE-AC05-00OR22725]
Laboratory Directed Research and Development (LDRD) program of Oak Ridge National Laboratory, under LDRD project [9831]
California Department of Public Health [103885]
Centers for Disease Control and Prevention's (CDC) National Program of Cancer Registries [5NU58DP006344]
National Cancer Institute's Surveillance, Epidemiology and End Results Program [HHSN261201800032I, HHSN261201800015I, HHSN261201800009I]
NCI Surveillance, Epidemiology and End Results (SEER) Program [HHSN261201800013I]
CDC National Program of Cancer Registries (NPCR) [U58DP00003907]
Commonwealth of Kentucky
NCI [HHSN26120180000 7I, HHSN26120 1300021I]
Surveillance, Epidemiology and End Results (SEER) Program [HHSN26120180000 7I, HHSN26120 1300021I]
CDC's National Program of Cancer Registries (NPCR) [NU58DP006332-02-00, NU58DP006279-02-00]
State of Louisiana
State of New Jersey
Rutgers Cancer Institute of New Jersey
National Cancer Institute's Surveillance, Epidemiology and End Results (SEER) Program [HHSN26120180001 4I, HHSN26100001]
National Cancer Institute's SEER Program [HHSN261291800004I, HHSN261201800016I]
Fred Hutchinson Cancer Research Center
US Centers for Disease Control and Prevention's National Program of Cancer Registries [NU58DP0063 200]
University of Utah
Huntsman Cancer Foundation
DOE Office of Science [DE-AC05-00OR22725]
US Department of Energy (DOE) [DE-AC05-00OR22725]
DOE

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

With the advancement of AI and machine learning in biomedical informatics, the concern for data security and privacy has become crucial. This study aims to quantify the privacy vulnerability of deep learning models for information extraction from medical textural contents, and propose ways to secure patients' information. Results show that the proposed vocabulary selection methods reduce privacy vulnerability while maintaining clinical task performance.

BACKGROUND: With the use of artificial intelligence and machine learning techniques for biomedical informatics, security and privacy concerns over the data and subject identities have also become an important issue and essential research topic. Without intentional safeguards, machine learning models may find patterns and features to improve task performance that are associated with private personal information. OBJECTIVE: The privacy vulnerability of deep learning models for information extraction from medical textural contents needs to be quantified since the models are exposed to private health information and personally identifiable information. The objective of the study is to quantify the privacy vulnerability of the deep learning models for natural language processing and explore a proper way of securing patients' information to mitigate confidentiality breaches. METHODS: The target model is the multitask convolutional neural network for information extraction from cancer pathology reports, where the data for training the model are from multiple state population-based cancer registries. This study proposes the following schemes to collect vocabularies from the cancer pathology reports; (a) words appearing in multiple registries, and (b) words that have higher mutual information. We performed membership inference attacks on the models in high-performance computing environments. RESULTS: The comparison outcomes suggest that the proposed vocabulary selection methods resulted in lower privacy vulnerability while maintaining the same level of clinical task performance.

Optimal vocabulary selection approaches for privacy-preserving deep NLP model training for information extraction and cancer epidemiology

期刊

CANCER BIOMARKERS

出版社

IOS PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Optimal vocabulary selection approaches for privacy-preserving deep NLP model training for information extraction and cancer epidemiology

期刊

CANCER BIOMARKERS

出版社

IOS PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文