☆ 4.4 Article

Designing compact training sets for data-driven molecular property prediction through optimal exploitation and exploration

MOLECULAR SYSTEMS DESIGN & ENGINEERING (2019)

Journal

MOLECULAR SYSTEMS DESIGN & ENGINEERING

Volume 4, Issue 5, Pages 1048-1057

Publisher

ROYAL SOC CHEMISTRY

DOI: 10.1039/c9me00078j

Keywords

Funding

Lehigh University

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

In this paper, we consider the problem of designing a compact training set comprising the most informative molecules from a specified library to build data-driven molecular property models. Specifically, using (i) sparse generalized group additivity and (ii) kernel ridge regression as two representative classes of models, we propose a method combining rigorous model-based design of experiments and cheminformatics-based diversity-maximizing subset selection within the epsilon-greedy framework to systematically minimize the amount of data needed to train these models. We demonstrate the effectiveness of the algorithm on various databases, including QM7, NIST, and a dataset of surface intermediates for calculating thermodynamic properties (heat of atomization and enthalpy of formation). For sparse group additive models, a balance between exploration (diversity-maximizing selection) and exploitation (D-optimality selection) leads to learning with a fraction (sometimes as little as 15%) of the data to achieve similar accuracy to five-fold cross validation on the entire set. On the other hand, our results indicate that kernel methods prefer diversity-maximizing selection.

Designing compact training sets for data-driven molecular property prediction through optimal exploitation and exploration

Journal

MOLECULAR SYSTEMS DESIGN & ENGINEERING

Publisher

ROYAL SOC CHEMISTRY

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Designing compact training sets for data-driven molecular property prediction through optimal exploitation and exploration

Journal

MOLECULAR SYSTEMS DESIGN & ENGINEERING

Publisher

ROYAL SOC CHEMISTRY

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper