4.7 Article

LAutomated Chemical Reaction Extraction from Scientific Literature

期刊

JOURNAL OF CHEMICAL INFORMATION AND MODELING
卷 62, 期 9, 页码 2035-2045

出版社

AMER CHEMICAL SOC
DOI: 10.1021/acs.jcim.1c00284

关键词

-

资金

  1. DARPA Accelerated Molecular Discovery (AMD) program [HR00111920025]
  2. Machine Learning for Pharmaceutical Discovery and Synthesis Consortium (MLPDS)
  3. Defence Threat Reduction Agency [HDTRA12110013]

向作者/读者索取更多资源

Access to structured chemical reaction data is crucial for chemists in bench experiments and applications like computer-aided drug design. This study focuses on developing automated methods for extracting reactions from chemical literature. Two-stage deep learning models based on Transformer are utilized, achieving high performance and data efficiency with only hundreds of annotated reactions.
Access to structured chemical reaction data is of key importance for chemists in performing bench experiments and in modern applications like computer-aided drug design. Existing reaction databases are generally populated by human curators through manual abstraction from published literature (e.g., patents and journals), which is time consuming and labor intensive, especially with the exponential growth of chemical literature in recent years. In this study, we focus on developing automated methods for extracting reactions from chemical literature. We consider journal publications as the target source of information, which are more comprehensive and better represent the latest developments in chemistry compared to patents; however, they are less formulaic in their descriptions of reactions. To implement the reaction extraction system, we first devised a chemical reaction schema, primarily including a central product, and a set of associated reaction roles such as reactants, catalyst, solvent, and so on. We formulate the task as a structure prediction problem and solve it with a two-stage deep learning framework consisting of product extraction and reaction role labeling. Both models are built upon Transformer-based encoders, which are adaptively pretrained using domain and task-relevant unlabeled data. Our models are shown to be both effective and data efficient, achieving an F1 score of 76.2% in product extraction and 78.7% in role extraction, with only hundreds of annotated reactions.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据