4.7 Article

BioCompoundML: A General Biofuel Property Screening Tool for Biological Molecules Using Random Forest Classifiers

期刊

ENERGY & FUELS
卷 30, 期 10, 页码 8410-8418

出版社

AMER CHEMICAL SOC
DOI: 10.1021/acs.energyfuels.6b01952

关键词

-

资金

  1. Bioenergy Technologies and Vehicle Technologies Offices, Office of Energy Efficiency and Renewable Energy (EERE), U.S. Department of Energy (DOE)
  2. National Nuclear Security Administration, DOE [DE-AC04-94AL85000]
  3. Office of Biological and Environmental Research, Office of Science, DOE [DE-AC02-05CH11231]
  4. Vehicle Technologies Office, DOE [DE347AC36-99GO10337]
  5. National Renewable Energy Laboratory

向作者/读者索取更多资源

Screening a large number of biologically derived molecules for potential fuel compounds without recourse to experimental testing is important in identifying understudied yet valuable molecules. Experimental testing, although a valuable standard for measuring fuel properties, has several major limitations, including the requirement of testably high quantities, considerable expense, and a large amount of time. This paper discusses the development of a general-purpose fuel property tool, using machine learning, whose outcome is to screen molecules for desirable fuel properties. BioCompoundML adopts a general methodology, requiring as input only a list of training compounds (with identifiers and measured values) and a list of testing compounds (with identifiers). For the training data, BioCompoundML collects open data from the National Center for Biotechnology Information, incorporates user-provided features, imputes missing values, performs feature reduction, builds a classifier, and clusters compounds. BioCompoundML then collects data for the testing compounds, predicts class membership, and determines whether compounds are found in the range of variability of the training data set. This tool is demonstrated using three different fuel properties: research octane number (RON), threshold soot index (TSI), and melting point (MP). We provide measures of its success with these properties using randomized train/test measurements: average accuracy is 88% in RON, 85% in TSI, and 94% in MP; average precision is 88% in RON, 88% in TSI, and 95% in MP; and average recall is 88% in RON, 82% in TSI, and 97% in MP. The receiver operator characteristics (area under the curve) were estimated at 0.88 in RON, 0.86 in TSI, and 0.87 in MP. We also measured the success of BioCompoundML by sending 16 compounds for direct RON determination. Finally, we provide a screen of 1977 hydrocarbons/oxygenates within the 8696 compounds in MetaCyc, identifying compounds with high predictive strength for high or low RON.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据