4.4 Article

The influence of scaling metabolomics data on model classification accuracy

Journal

METABOLOMICS
Volume 11, Issue 3, Pages 684-695

Publisher

SPRINGER
DOI: 10.1007/s11306-014-0738-7

Keywords

Pre-treatment; Metabolomics; Classification accuracy; Scaling; Chemometrics

Funding

  1. PhastID European project within Seventh Framework Programme for Research and Technological Development [258238]
  2. Engineering and Physical Sciences Research Council [GR/S96685/01] Funding Source: researchfish

Ask authors/readers for more resources

Correctly measured classification accuracy is an important aspect not only to classify pre-designated classes such as disease versus control properly, but also to ensure that the biological question can be answered competently. We recognised that there has been minimal investigation of pre-treatment methods and its influence on classification accuracy within the metabolomics literature. The standard approach to pre-treatment prior to classification modelling often incorporates the use of methods such as autoscaling, which positions all variables on a comparable scale thus allowing one to achieve separation of two or more groups (target classes). This is often undertaken without any prior investigation into the influence of the pre-treatment method on the data and supervised learning techniques employed. Whilst this is useful for deriving essential information such as predictive ability or visual interpretation in many cases, as shown in this study the standard approach is not always the most suitable option available. Here, a study has been conducted to investigate the influence of six pre-treatment methods-autoscaling, range, level, Pareto and vast scaling, as well as no scaling-on four classification models, including: principal components-discriminant function analysis (PC-DFA), support vector machines (SVM), random forests (RF) and k-nearest neighbours (kNN)-using three publically available metabolomics data sets. We have demonstrated that undertaking different pre-treatment methods can greatly affect the interpretation of the statistical modelling outputs. The results have shown that data pre-treatment is context dependent and that there was no single superior method for all the data sets used. Whilst we did find that vast scaling produced the most robust models in terms of classification rate for PC-DFA of both NMR spectroscopy data sets, in general we conclude that both vast scaling and autoscaling produced similar and superior results in comparison to the other four pre-treatment methods on both NMR and GC-MS data sets. It is therefore our recommendation that vast scaling is the primary pre-treatment method to use as this method appears to be more stable and robust across all the different classifiers that were conducted in this study.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.4
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available