4.8 Article

Extracting Predictive Representations from Hundreds of Millions of Molecules

Journal

JOURNAL OF PHYSICAL CHEMISTRY LETTERS
Volume 12, Issue 44, Pages 10793-10801

Publisher

AMER CHEMICAL SOC
DOI: 10.1021/acs.jpclett.1c03058

Keywords

-

Funding

  1. Shenzhen Science and Technology Research Grant [JCYJ20200109140416788]
  2. Chemistry and Chemical Engineering Guangdong Laboratory [1922018]
  3. Soft Science Research Project of Guangdong Province [2017B030301013]
  4. NSF [DMS-2052983, DMS1761320, IIS1900473]
  5. Bristol-Myers Squibb
  6. Pfizer
  7. Michigan State University
  8. NIH [GM126189]

Ask authors/readers for more resources

The study introduces a self-supervised learning approach to pretrain models from unlabeled molecules and extract predictive representations for specific tasks. Extensive validation indicates that the proposed method shows superb performance.
The construction of appropriate representations remains essential for molecular predictions due to intricate molecular complexity. Additionally, it is often expensive and ethically constrained to generate labeled data for supervised learning in molecular sciences, leading to challenging small and diverse data sets. In this work, we develop a self-supervised learning approach to pretrain models from over 700 million unlabeled molecules in multiple databases. The intrinsic chemical logic learned from this approach enables the extraction of predictive representations from task-specific molecular sequences in a fine-tuned process. To understand the importance of self-supervised learning from unlabeled molecules, we assemble three models with different combinations of databases. Moreover, we propose a protocol based on data traits to automatically select the optimal model for a specific task. To validate the proposed method, we consider 10 benchmarks and 38 virtual screening data sets. Extensive validation indicates that the proposed method shows superb performance.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.8
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available