4.6 Article

Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration

期刊

RESEARCH
卷 2022, 期 -, 页码 -

出版社

AMER ASSOC ADVANCEMENT SCIENCE
DOI: 10.34133/research.0004

关键词

-

资金

  1. High -Performance Com- puting Center of Central South University
  2. National Key Research and Development Program of China
  3. Na tional Natural Science Foundation of China
  4. Hunan Provincial Science Fund for Distinguished Young Scholars
  5. Science and Technology Innovation
  6. [2021YFF1201400]
  7. [U1811462]
  8. [22173118]
  9. [2021JJ10068]

向作者/读者索取更多资源

Accurate prediction of pharmacological properties of small molecules is crucial in drug discovery. However, existing deep learning methods face challenges in handling scarcity of labeled data and information sharing among different tasks. In this study, we propose a novel multitask learning BERT framework, MTL-BERT, which leverages large-scale pre-training, multitask learning, and SMILES enumeration to address the data scarcity problem. Experimental results show that MTL-BERT outperforms state-of-the-art methods on 60 practical molecular datasets and leverages attention mechanisms for model interpretability.
Accurate prediction of pharmacological properties of small molecules is becoming increasingly important in drug discovery. Traditional feature-engineering approaches heavily rely on handcrafted descriptors and/or fingerprints, which need extensive human expert knowledge. With the rapid progress of artificial intelligence technology, data-driven deep learning methods have shown unparalleled advantages over feature-engineering-based methods. However, existing deep learning methods usually suffer from the scarcity of labeled data and the inability to share information between different tasks when applied to predicting molecular properties, thus resulting in poor generalization capability. Here, we proposed a novel multitask learning BERT (Bidirectional Encoder Representations from Transformer) framework, named MTL-BERT, which leverages large-scale pre-training, multitask learning, and SMILES (simplified molecular input line entry specification) enumeration to alleviate the data scarcity problem. MTL-BERT first exploits a large amount of unlabeled data through self-supervised pretraining to mine the rich contextual information in SMILES strings and then fine-tunes the pretrained model for multiple downstream tasks simultaneously by leveraging their shared information. Meanwhile, SMILES enumeration is used as a data enhancement strategy during the pretraining, fine-tuning, and test phases to substantially increase data diversity and help to learn the key relevant patterns from complex SMILES strings. The experimental results showed that the pretrained MTL-BERT model with few additional fine-tuning can achieve much better performance than the state-of-the-art methods on most of the 60 practical molecular datasets. Additionally, the MTL-BERT model leverages attention mechanisms to focus on SMILES character features essential to target properties for model interpretability.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据