☆ 4.6 Article

Evaluation of Classifier Performance for Multiclass Phenotype Discrimination in Untargeted Metabolomics

METABOLITES (2017)

期刊

METABOLITES

卷 7, 期 2, 页码 -

出版社

MDPI

DOI: 10.3390/metabo7020030

关键词

metabolomic phenotyping; statistical classification; machine learning; discrimination; partial least squares-discriminant analysis; Random Forests; support vector machines; artificial Neural Networks; Naive Bayes; k-Nearest Neighbors

类别

Biochemistry & Molecular Biology

资金

American Heart Association [11CRP7300003]
National Institute of General Medical Sciences [GM103492]
Wendell Cherry Chair in Clinical Trial Research
James Graham Brown Cancer Center

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Statistical classification is a critical component of utilizing metabolomics data for examining the molecular determinants of phenotypes. Despite this, a comprehensive and rigorous evaluation of the accuracy of classification techniques for phenotype discrimination given metabolomics data has not been conducted. We conducted such an evaluation using both simulated and real metabolomics datasets, comparing Partial Least Squares-Discriminant Analysis (PLS-DA), Sparse PLS-DA, Random Forests, Support Vector Machines (SVM), Artificial Neural Network, k-Nearest Neighbors (k-NN), and Naive Bayes classification techniques for discrimination. We evaluated the techniques on simulated data generated to mimic global untargeted metabolomics data by incorporating realistic block-wise correlation and partial correlation structures for mimicking the correlations and metabolite clustering generated by biological processes. Over the simulation studies, covariance structures, means, and effect sizes were stochastically varied to provide consistent estimates of classifier performance over a wide range of possible scenarios. The effects of the presence of non-normal error distributions, the introduction of biological and technical outliers, unbalanced phenotype allocation, missing values due to abundances below a limit of detection, and the effect of prior-significance filtering (dimension reduction) were evaluated via simulation. In each simulation, classifier parameters, such as the number of hidden nodes in a Neural Network, were optimized by cross-validation to minimize the probability of detecting spurious results due to poorly tuned classifiers. Classifier performance was then evaluated using real metabolomics datasets of varying sample medium, sample size, and experimental design. We report that in the most realistic simulation studies that incorporated non-normal error distributions, unbalanced phenotype allocation, outliers, missing values, and dimension reduction, classifier performance (least to greatest error) was ranked as follows: SVM, Random Forest, Naive Bayes, sPLS-DA, Neural Networks, PLS-DA and k-NN classifiers. When non-normal error distributions were introduced, the performance of PLS-DA and k-NN classifiers deteriorated further relative to the remaining techniques. Over the real datasets, a trend of better performance of SVM and Random Forest classifier performance was observed.

Evaluation of Classifier Performance for Multiclass Phenotype Discrimination in Untargeted Metabolomics

期刊

METABOLITES

出版社

MDPI

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Evaluation of Classifier Performance for Multiclass Phenotype Discrimination in Untargeted Metabolomics

期刊

METABOLITES

出版社

MDPI

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文