☆ 4.7 Article

Identifying who has long COVID in the USA: a machine learning approach using N3C data

LANCET DIGITAL HEALTH (2022)

期刊

LANCET DIGITAL HEALTH

卷 4, 期 7, 页码 E532-E541

出版社

ELSEVIER

DOI: 10.1016/S2589-7500(22)00048-6

关键词

类别

Medical Informatics Medicine, General & Internal

资金

US National Institutes of Health
National Center for Advancing Translational Sciences through the RECOVER Initiative

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Long COVID has had a severe impact on patient and societal recovery from the COVID-19 pandemic. This study developed machine learning models to accurately identify potential long COVID patients using electronic health records. Important features in identifying long COVID included healthcare utilization rate, patient age, dyspnea, and other diagnosis and medication information.

Background Post-acute sequelae of SARS-CoV-2 infection, known as long COVID, have severely affected recovery from the COVID-19 pandemic for patients and society alike. Long COVID is characterised by evolving, heterogeneous symptoms, making it challenging to derive an unambiguous definition. Studies of electronic health records are a crucial element of the US National Institutes of Health's RECOVER Initiative, which is addressing the urgent need to understand long COVID, identify treatments, and accurately identify who has it-the latter is the aim of this study. Methods Using the National COVID Cohort Collaborative's (N3C) electronic health record repository, we developed XGBoost machine learning models to identify potential patients with long COVID. We defined our base population (n=1793604) as any non-deceased adult patient (age a18 years) with either an International Classification of Diseases-10-Clinical Modification COVID-19 diagnosis code (U07.1) from an inpatient or emergency visit, or a positive SARS-CoV-2 PCR or antigen test, and for whom at least 90 days have passed since COVID-19 index date. We examined demographics, health-care utilisation, diagnoses, and medications for 97 995 adults with COVID-19. We used data on these features and 597 patients from a long COVID clinic to train three machine learning models to identify potential long COVID among all patients with COVID-19, patients hospitalised with COVID-19, and patients who had COVID-19 but were not hospitalised. Feature importance was determined via Shapley values. We further validated the models on data from a fourth site. Findings Our models identified, with high accuracy, patients who potentially have long COVID, achieving areas under the receiver operator characteristic curve of 0.92 (all patients), 0.90 (hospitalised), and 0.85 (non-hospitalised). Important features, as defined by Shapley values, include rate of health-care utilisation, patient age, dyspnoea, and other diagnosis and medication information available within the electronic health record. Interpretation Patients identified by our models as potentially having long COVID can be interpreted as patients warranting care at a specialty clinic for long COVID, which is an essential proxy for long COVID diagnosis as its definition continues to evolve. We also achieve the urgent goal of identifying potential long COVID in patients for clinical trials. As more data sources are identified, our models can be retrained and tuned based on the needs of individual studies. Copyright (C) 2022 The Author(s). Published by Elsevier Ltd.

Identifying who has long COVID in the USA: a machine learning approach using N3C data

期刊

LANCET DIGITAL HEALTH

出版社

ELSEVIER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Identifying who has long COVID in the USA: a machine learning approach using N3C data

期刊

LANCET DIGITAL HEALTH

出版社

ELSEVIER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文