☆ 4.6 Article

Batch Effect Confounding Leads to Strong Bias in Performance Estimates Obtained by Cross-Validation

PLOS ONE (2014)

期刊

PLOS ONE

卷 9, 期 6, 页码 -

出版社

PUBLIC LIBRARY SCIENCE

DOI: 10.1371/journal.pone.0100335

关键词

类别

Multidisciplinary Sciences

资金

Swiss National Science Foundation (SNF) [320030_ 135421]
Fondation Medic
Swiss National Science Foundation (SNF) [320030_135421] Funding Source: Swiss National Science Foundation (SNF)

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Background: With the large amount of biological data that is currently publicly available, many investigators combine multiple data sets to increase the sample size and potentially also the power of their analyses. However, technical differences (batch effects'') as well as differences in sample composition between the data sets may significantly affect the ability to draw generalizable conclusions from such studies. Focus: The current study focuses on the construction of classifiers, and the use of cross-validation to estimate their performance. In particular, we investigate the impact of batch effects and differences in sample composition between batches on the accuracy of the classification performance estimate obtained via cross-validation. The focus on estimation bias is a main difference compared to previous studies, which have mostly focused on the predictive performance and how it relates to the presence of batch effects. Data: We work on simulated data sets. To have realistic intensity distributions, we use real gene expression data as the basis for our simulation. Random samples from this expression matrix are selected and assigned to group 1 (e.g., 'control') or group 2 (e.g., 'treated'). We introduce batch effects and select some features to be differentially expressed between the two groups. We consider several scenarios for our study, most importantly different levels of confounding between groups and batch effects. Methods: We focus on well-known classifiers: logistic regression, Support Vector Machines (SVM), k-nearest neighbors (kNN) and Random Forests (RF). Feature selection is performed with the Wilcoxon test or the lasso. Parameter tuning and feature selection, as well as the estimation of the prediction performance of each classifier, is performed within a nested cross-validation scheme. The estimated classification performance is then compared to what is obtained when applying the classifier to independent data.

Batch Effect Confounding Leads to Strong Bias in Performance Estimates Obtained by Cross-Validation

期刊

PLOS ONE

出版社

PUBLIC LIBRARY SCIENCE

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Batch Effect Confounding Leads to Strong Bias in Performance Estimates Obtained by Cross-Validation

期刊

PLOS ONE

出版社

PUBLIC LIBRARY SCIENCE

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文