☆ 4.4 Article

Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies

METABOLOMICS (2018)

Journal

METABOLOMICS

Volume 14, Issue 10, Pages -

Publisher

SPRINGER

DOI: 10.1007/s11306-018-1420-2

Keywords

Untargeted metabolomics; Missing values imputation; Limit of detection; Batch effects; MICE; K-nearest neighbor; Mass spectrometry

Funding

German Federal Ministry of Education and Research (BMBF)
BMBF [01ZX1313C, 03IS2061B]
European Union's Seventh Framework Programme [FP7-Health-F5-2012] [305280]
European Research Council (starting grant LatentCauses)
Biomedical Research Program funds at Weill Cornell Medical College in Qatar
Qatar Foundation
Helmholtz Zentrum Munchen, German Research Center for Environmental Health, Neuherberg, Germany
Medical Research Council [MC_PC_13048, MC_UU_12015/1]
MRC [MC_UU_12015/1] Funding Source: UKRI

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

BackgroundUntargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in biomedical studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation.MethodsWe investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n=1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established metabolic quantitative trait loci.ResultsRun day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable.ConclusionMissing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes.

Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies

Journal

METABOLOMICS

Publisher

SPRINGER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies

Journal

METABOLOMICS

Publisher

SPRINGER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper