4.7 Article

Robustifying genomic classifiers to batch effects via ensemble learning

期刊

BIOINFORMATICS
卷 37, 期 11, 页码 1521-1527

出版社

OXFORD UNIV PRESS
DOI: 10.1093/bioinformatics/btaa986

关键词

-

资金

  1. Division of Mathematical Sciences, National Science Foundation (NSF-DMS) [1810829]
  2. National Cancer Institute, National Institutes of Health (NIH-NCI) [4P30CA006516-51, 5R01GM127430-02]
  3. Direct For Mathematical & Physical Scien
  4. Division Of Mathematical Sciences [1810829] Funding Source: National Science Foundation

向作者/读者索取更多资源

This study compares two strategies for handling batch effects in genomic data, showing that an ensemble learning strategy offers more robust performance, especially in cases of high severity of batch effects. The results provide practical guidelines for the development and evaluation of genomic classifiers.
Motivation: Genomic data are often produced in batches due to practical restrictions, which may lead to unwanted variation in data caused by discrepancies across batches. Such 'batch effects' often have negative impact on downstream biological analysis and need careful consideration. In practice, batch effects are usually addressed by specifically designed software, which merge the data from different batches, then estimate batch effects and remove them from the data. Here, we focus on classification and prediction problems, and propose a different strategy based on ensemble learning. We first develop prediction models within each batch, then integrate them through ensemble weighting methods. Results: We provide a systematic comparison between these two strategies using studies targeting diverse populations infected with tuberculosis. In one study, we simulated increasing levels of heterogeneity across random subsets of the study, which we treat as simulated batches. We then use the two methods to develop a genomic classifier for the binary indicator of disease status. We evaluate the accuracy of prediction in another independent study targeting a different population cohort. We observed that in independent validation, while merging followed by batch adjustment provides better discrimination at low level of heterogeneity, our ensemble learning strategy achieves more robust performance, especially at high severity of batch effects. These observations provide practical guidelines for handling batch effects in the development and evaluation of genomic classifiers.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据