4.6 Article

Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients

期刊

APPLIED SCIENCES-BASEL
卷 12, 期 19, 页码 -

出版社

MDPI
DOI: 10.3390/app12199976

关键词

protected health information; natural language processing (NLP); named entity recognition (NER); de-identification; conditional random field (CRF)

资金

  1. Roche Diagnostics Information Solutions

向作者/读者索取更多资源

The de-identification of clinical reports is crucial for patient confidentiality protection. This study explores the utility of various features in a conditional-random-field-based named entity recognition model through annotating a large volume of radiology reports and building NER models. The results indicate that n-gram, prefix-suffix, word embedding, and word shape are the best-performing features.
The de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity recognition (NER) model is a widely used technique of automatic clinical de-identification. The performance of such a machine learning model relies largely on the proper selection of features. The objective of this study was to investigate the utility of various features in a conditional-random-field (CRF)-based NER model. Natural language processing (NLP) toolkits were used to annotate the protected health information (PHI) from a total of 10,239 radiology reports that were divided into seven types. Multiple features were extracted by the toolkit and the NER models were built using these features and their combinations. A total of 10 features were extracted and the performance of the models was evaluated based on their precision, recall, and F-1-score. The best-performing features were n-gram, prefix-suffix, word embedding, and word shape. These features outperformed others across all types of reports. The dataset we used was large in volume and divided into multiple types of reports. Such a diverse dataset made sure that the results were not subject to a small number of structured texts from where a machine learning model can easily learn the features. The manual de-identification of large-scale clinical reports is impractical. This study helps to identify the best-performing features for building an NER model for automatic de-identification from a wide array of features mentioned in the literature.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据