4.7 Article

The double cross-validation software tool for MLR QSAR model development

Journal

CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS
Volume 159, Issue -, Pages 108-126

Publisher

ELSEVIER
DOI: 10.1016/j.chemolab.2016.10.009

Keywords

QSAR; QSPR; Double cross-validation; Prediction errors; Validation set

Funding

  1. University with Potential for Excellence Phase II (UPE-II) scheme from the University Grants Commission, New Delhi

Ask authors/readers for more resources

Quantitative structure activity relationship (QSAR) modeling is a widely used computational technique applied in various fields including rational drug design, toxicity and property prediction of chemicals and pharmaceuticals, environmental risk assessment and fate modeling. External validation is generally considered as the gold standard in evaluating the model predictivity performance, at least to a group of QSAR practitioners. External validation is commonly performed by employing the hold-out method, where the original dataset is divided into training and test sets; the training set is employed for model building and model selection, while the test set is solely used in model assessment. However, since the composition of the training set remains the same in this method, it is not certain that the resultant model is optimal as there may be a bias in descriptor selection. This problem is more evident for the multiple linear regression (MLR) models than more robust and generalized partial least squares (PLS) and principal component regression (PCR) models. Thus, employing double cross validation technique could be a better choice, in which the training set is further divided into 'n' calibration and validation sets resulting in diverse compositions. In the present work, we have developed an open access Double Cross-Validation (DCV) software tool which can be used to perform multiple linear regression (MLR) model development by employing the double cross-validation technique. Two variable selection methods, namely, stepwise MLR (S-MLR) and genetic algorithm MLR (GA-MLR) are incorporated in this tool and optionally, this tool also performs a data-pretreatment prior to the application of double cross-validation. Also, we have performed a study using the Double Cross-Validation tool on three different datasets in order to find out which technique among the hold-out and double cross-validation performs better in the selection of an optimal model in terms of model predictive performance checked on the test set. The performance of the tool in generating predictive PLS models is also compared.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available