☆ 4.7 Article

Exploring the impact of size of training sets for the development of predictive QSAR models

CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS (2008)

Journal

CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS

Volume 90, Issue 1, Pages 31-42

Publisher

ELSEVIER SCIENCE BV

DOI: 10.1016/j.chemolab.2007.07.004

Keywords

QSAR; validation; training set size; K-means clusters; stepwise regression; FA-MLR; PLS

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

While building a predictive quantitative structure-activity relationship (QSAR), validation of the developed model is a very important task. However, a truly new set of data being often unavailable for checking predictability and robustness of the developed model, a typical external validation in QSAR studies is commonly performed by splitting the available data into training and test sets. In the present work we have attempted to explore the impact of training set size on the quality of prediction using different topological descriptors and three different statistical techniques. Three different data sets of moderate size have been used for the present study: cytoprotection data of anti-HIV thiocarbamates (n=62), HIV reverse transcriptase inhibition data of 1-[(2-hydroxyethoxy)methyl]-6-(phenylthio)thymine (HEPT) derivatives (n=107) and bioconcentration factor data of diverse functional compounds (n=122). In each case, the data set was divided into different combinations of training and test sets maintaining different size ratios in several iterations. In cases of the first two data sets, significant impact of reduction of training set size was found on the predictive ability of the models while the first data set showing higher dependence on the size than the second one. However, in case of modeling of bioconcentration factor, no significant impact of training set size on the quality of prediction could be found. Hence, no general rule can be formulated regarding the impact of training set size on the quality of prediction. Optimum size of the training set should be set based on a particular data set and types of descriptors and statistical analysis being used. (c) 2007 Elsevier B.V. All rights reserved.

Exploring the impact of size of training sets for the development of predictive QSAR models

Journal

CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS

Publisher

ELSEVIER SCIENCE BV

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Exploring the impact of size of training sets for the development of predictive QSAR models

Journal

CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS

Publisher

ELSEVIER SCIENCE BV

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper