4.3 Article

Investigating the influence of data splitting on the predictive ability of QSAR/QSPR models

Journal

STRUCTURAL CHEMISTRY
Volume 22, Issue 4, Pages 795-804

Publisher

SPRINGER/PLENUM PUBLISHERS
DOI: 10.1007/s11224-011-9757-4

Keywords

Data splitting; External validation; QSAR; QSPR; Predictivity; Kennard-Stone; Duplex; Model reproducibility

Funding

  1. Foundation for Polish Science
  2. Norwegian Financial Mechanism
  3. EEA Financial Mechanism in Poland
  4. Polish Ministry of Science and Higher Education [DS/8430-4-0171-11]

Ask authors/readers for more resources

The study was aimed at investigating how the method of splitting data into a training set and a test set influences the external predictivity of quantitative structure-activity and/or structure-property relationships (QSAR/QSPR) models. Six models of good quality were collected from the literature and then redeveloped and validated on the basis of five alternative splitting algorithms, namely: (i) a commonly used algorithm ('Z:1'), in which every zth (e.g. third) from the compounds sorted ascending (according to the response values, y) is selected into the test set; (ii-iv) three variations of the Kennard-Stone algorithm and (v) the duplex algorithm. The external validation statistics reported for each model served as a basis for the final comparison. We demonstrated that the splitting techniques utilizing the values of molecular descriptors alone (X) or in combination with the model response (y) always lead to the development of the models yielding better external predictivity in comparison with the models designed with methodologies based on the y values only. Moreover, we showed that the external validation coefficient (Q (EXT) (2)) is more sensitive to the splitting technique than the root-mean-square error of prediction (RMSEP). This difference becomes especially important when the test set is relatively small (between 5 and 10 compounds). In the case of the models trained/validated with a small number of compounds, it is strongly recommended that both statistics (Q (EXT) (2) and RMSEP) be taken into account for the external predictivity evaluation.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.3
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available