4.6 Article

Potential limitations in COVID-19 machine learning due to data source variability: A case study in the nCov2019 dataset

出版社

OXFORD UNIV PRESS
DOI: 10.1093/jamia/ocaa258

关键词

COVID-19; data quality; machine learning; biases; data sharing; distributed research networks; multi-site data; variability; heterogeneity; dataset shift

资金

  1. Universitat Politecnica de Valencia [UPV-SUB.2-1302]
  2. FONDO SUPERA COVID-19 by CRUE-Santander Bank grant Severity Subgroup Discovery and Classification on COVID-19 Real World Data through Machine Learning and Data Quality assessment (SUBCOVERWD-19)

向作者/读者索取更多资源

The lack of representative COVID-19 data poses a challenge for reliable machine learning, and variability in data sources can lead to biases and increase the risk of overfitting in models. It is important to systematically assess and report data source variability and quality to ensure reliable and generalizable machine learning in COVID-19 research.
Objective: The lack of representative coronavirus disease 2019 (COVID-19) data is a bottleneck for reliable and generalizable machine learning. Data sharing is insufficient without data quality, in which source variability plays an important role. We showcase and discuss potential biases from data source variability for COVID-19 machine learning. Materials and Methods: We used the publicly available nCov2019 dataset, including patient-level data from several countries. We aimed to the discovery and classification of severity subgroups using symptoms and comorbidities. Results: Cases from the 2 countries with the highest prevalence were divided into separate subgroups with distinct severity manifestations. This variability can reduce the representativeness of training data with respect the model target populations and increase model complexity at risk of overfitting. Conclusions: Data source variability is a potential contributor to bias in distributed research networks. We call for systematic assessment and reporting of data source variability and data quality in COVID-19 data sharing, as key information for reliable and generalizable machine learning.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据