☆ 4.7 Article

Extremely missing numerical data in Electronic Health Records for machine learning can be managed through simple imputation methods considering informative missingness: A comparative of solutions in a COVID-19 mortality case study

COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE (2023)

期刊

COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE

卷 242, 期 -, 页码 -

出版社

ELSEVIER IRELAND LTD

DOI: 10.1016/j.cmpb.2023.107803

关键词

Machine learning; Missing data; Data imputation; Informative missingness; Electronic health records; COVID-19

类别

Computer Science, Interdisciplinary Applications Computer Science, Theory & Methods Engineering, Biomedical Medical Informatics

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This study aims to characterize effective data imputation techniques and machine learning models for dealing with highly missing numerical data in electronic health records. The results suggest that combining translation and encoding imputation with tree ensemble classifiers can maximize performance in the presence of extremely incomplete data.

Background and objective: Reusing Electronic Health Records (EHRs) for Machine Learning (ML) leads on many occasions to extremely incomplete and sparse tabular datasets, which can hinder the model development processes and limit their performance and generalization. In this study, we aimed to characterize the most effective data imputation techniques and ML models for dealing with highly missing numerical data in EHRs, in the case where only a very limited number of data are complete, as opposed to the usual case of having a reduced number of missing values. Methods: We used a case study including full blood count laboratory data, demographic and survival data in the context of COVID-19 hospital admissions and evaluated 30 processing pipelines combining imputation methods with ML classifiers. The imputation methods included missing mask, translation and encoding, mean imputation, k-nearest neighbors' imputation, Bayesian ridge regression imputation and generative adversarial imputation networks. The classifiers included k-nearest neighbors, logistic regression, random forest, gradient boosting and deep multilayer perceptron. Results: Our results suggest that in the presence of highly missing data, combining translation and encoding imputation-which considers informative missingness-with tree ensemble classifiers-random forest and gradient boosting-is a sensible choice when aiming to maximize performance, in terms of area under curve. Conclusions: Based on our findings, we recommend the consideration of this imputer-classifier configuration when constructing models in the presence of extremely incomplete numerical data in EHR.

Extremely missing numerical data in Electronic Health Records for machine learning can be managed through simple imputation methods considering informative missingness: A comparative of solutions in a COVID-19 mortality case study

期刊

COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE

出版社

ELSEVIER IRELAND LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Extremely missing numerical data in Electronic Health Records for machine learning can be managed through simple imputation methods considering informative missingness: A comparative of solutions in a COVID-19 mortality case study

期刊

COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE

出版社

ELSEVIER IRELAND LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文