4.7 Article

ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science

期刊

JOURNAL OF CHEMICAL INFORMATION AND MODELING
卷 61, 期 9, 页码 4280-4289

出版社

AMER CHEMICAL SOC
DOI: 10.1021/acs.jcim.1c00446

关键词

-

资金

  1. EPSRC Centre for Doctoral Training in Computational Methods for Materials Science [EP/L015552/1]
  2. BASF/Royal Academy of Engineering Research Chair in Data-Driven Molecular Engineering of Functional Materials
  3. Science and Technology Facilities Council (STFC) via the ISIS Neutron and Muon Source

向作者/读者索取更多资源

The article introduces a framework for automated populating ontologies, enabling direct extraction of a larger group of properties linked by a semantic network. Exploiting data-rich sources, a new model concept is presented for data extraction of chemical and physical properties. With automatically generated parsers for data extraction and forward-looking interdependency resolution, the power of the approach is illustrated through automatic extraction of a crystallographic hierarchy.
The ever-growing abundance of data found in heterogeneous sources, such as scientific publications, has forced the development of automated techniques for data extraction. While in the past, in the physical sciences domain, the focus has been on the precise extraction of individual properties, attention has recently been devoted to the extraction of higher-level relationships. Here, we present a framework for an automated population of ontologies. That is, the direct extraction of a larger group of properties linked by a semantic network. We exploit data-rich sources, such as tables within documents, and present a new model concept that enables data extraction for chemical and physical properties with the ability to organize hierarchical data as nested information. Combining these capabilities with automatically generated parsers for data extraction and forward-looking interdependency resolution, we illustrate the power of our approach via the automatic extraction of a crystallographic hierarchy of information. This includes 18 interrelated submodels of nested data, extracted from an evaluation set of scientific articles, yielding an overall precision of 92.2%, across 26 different journals. Our method and associated toolkit, ChemDataExtractor 2.0, offers a key step toward the seamless integration of primary literature sources into a data-driven scientific framework.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据