☆ 4.6 Article

Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation

JOURNAL OF CHEMINFORMATICS (2014)

Journal

JOURNAL OF CHEMINFORMATICS

Volume 6, Issue -, Pages -

Publisher

BMC

DOI: 10.1186/s13321-014-0047-1

Keywords

Cross-validation; Double cross-validation; Internal validation; External validation; Prediction error; Regression

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Background: Generally, QSAR modelling requires both model selection and validation since there is no a priori knowledge about the optimal QSAR model. Prediction errors (PE) are frequently used to select and to assess the models under study. Reliable estimation of prediction errors is challenging -especially under model uncertainty and requires independent test objects. These test objects must not be involved in model building nor in model selection. Double cross-validation, sometimes also termed nested cross-validation, offers an attractive possibility to generate test data and to select QSAR models since it uses the data very efficiently. Nevertheless, there is a controversy in the literature with respect to the reliability of double cross-validation under model uncertainty. Moreover, systematic studies investigating the adequate parameterization of double cross-validation are still missing. Here, the cross-validation design in the inner loop and the influence of the test set size in the outer loop is systematically studied for regression models in combination with variable selection. Methods: Simulated and real data are analysed with double cross-validation to identify important factors for the resulting model quality. For the simulated data, a bias-variance decomposition is provided. Results: The prediction errors of QSAR/QSPR regression models in combination with variable selection depend to a large degree on the parameterization of double cross-validation. While the parameters for the inner loop of double cross-validation mainly influence bias and variance of the resulting models, the parameters for the outer loop mainly influence the variability of the resulting prediction error estimate. Conclusions: Double cross-validation reliably and unbiasedly estimates prediction errors under model uncertainty for regression models. As compared to a single test set, double cross-validation provided a more realistic picture of model quality and should be preferred over a single test set.

Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation

Journal

JOURNAL OF CHEMINFORMATICS

Publisher

BMC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation

Journal

JOURNAL OF CHEMINFORMATICS

Publisher

BMC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper