☆ 4.3 Article Proceedings Paper

An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling

SAR AND QSAR IN ENVIRONMENTAL RESEARCH (2016)

期刊

SAR AND QSAR IN ENVIRONMENTAL RESEARCH

卷 27, 期 11, 页码 911-937

出版社

TAYLOR & FRANCIS LTD

DOI: 10.1080/1062936X.2016.1253611

关键词

data curation; standardization; QSAR modelling; physicochemical properties; Open Data

类别

Chemistry, Multidisciplinary Computer Science, Interdisciplinary Applications Environmental Sciences Mathematical & Computational Biology Toxicology

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

The increasing availability of large collections of chemical structures and associated experimental data provides an opportunity to build robust QSAR models for applications in different fields. One common concern is the quality of both the chemical structure information and associated experimental data. Here we describe the development of an automated KNIME workflow to curate and correct errors in the structure and identity of chemicals using the publicly available PHYSPROP physicochemical properties and environmental fate datasets. The workflow first assembles structure-identity pairs using up to four provided chemical identifiers, including chemical name, CASRNs, SMILES, and MolBlock. Problems detected included errors and mismatches in chemical structure formats, identifiers and various structure validation issues, including hypervalency and stereochemistry descriptions. Subsequently, a machine learning procedure was applied to evaluate the impact of this curation process. The performance of QSAR models built on only the highest-quality subset of the original dataset was compared with the larger curated and corrected dataset. The latter showed statistically improved predictive performance. The final workflow was used to curate the full list of PHYSPROP datasets, and is being made publicly available for further usage and integration by the scientific community.

An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling

期刊

SAR AND QSAR IN ENVIRONMENTAL RESEARCH

出版社

TAYLOR & FRANCIS LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling

期刊

SAR AND QSAR IN ENVIRONMENTAL RESEARCH

出版社

TAYLOR & FRANCIS LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文