4.4 Article

Designing compact training sets for data-driven molecular property prediction through optimal exploitation and exploration

Journal

MOLECULAR SYSTEMS DESIGN & ENGINEERING
Volume 4, Issue 5, Pages 1048-1057

Publisher

ROYAL SOC CHEMISTRY
DOI: 10.1039/c9me00078j

Keywords

-

Funding

  1. Lehigh University

Ask authors/readers for more resources

In this paper, we consider the problem of designing a compact training set comprising the most informative molecules from a specified library to build data-driven molecular property models. Specifically, using (i) sparse generalized group additivity and (ii) kernel ridge regression as two representative classes of models, we propose a method combining rigorous model-based design of experiments and cheminformatics-based diversity-maximizing subset selection within the epsilon-greedy framework to systematically minimize the amount of data needed to train these models. We demonstrate the effectiveness of the algorithm on various databases, including QM7, NIST, and a dataset of surface intermediates for calculating thermodynamic properties (heat of atomization and enthalpy of formation). For sparse group additive models, a balance between exploration (diversity-maximizing selection) and exploitation (D-optimality selection) leads to learning with a fraction (sometimes as little as 15%) of the data to achieve similar accuracy to five-fold cross validation on the entire set. On the other hand, our results indicate that kernel methods prefer diversity-maximizing selection.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.4
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available