☆ 4.7 Article

Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets

BRIEFINGS IN BIOINFORMATICS (2021)

期刊

BRIEFINGS IN BIOINFORMATICS

卷 22, 期 4, 页码 -

出版社

OXFORD UNIV PRESS

DOI: 10.1093/bib/bbaa321

关键词

QSAR; machine learning; XGBoost; support vector machine; ensemble learning

类别

Biochemical Research Methods Mathematical & Computational Biology

资金

Key R&D Program of Zhejiang Province [2020C03010]
National Natural Science Foundation of China [21575128, 81773632]
Leading Talent of 'Ten Thousand Plan'-National High-Level Talents Special Support Plan
Zhejiang Provincial Natural Science Foundation of China [LZ19H300001]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

A study on learning QSAR models using various ML algorithms for 14 public datasets showed that rbf-SVM, rbf-GPR, XGBoost, and DNN generally perform better than other algorithms. SVM and XGBoost are recommended for regression learning on small datasets, while XGBoost is an excellent choice for large datasets. Ensemble models integrating multiple algorithms can improve prediction accuracy.

Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure-activity relationships (QSARs), there is no agreed single best algorithm for QSAR learning. Therefore, a comprehensive understanding of the performance characteristics of popular ML algorithms used in QSAR learning is highly desirable. In this study, five linear algorithms [linear function Gaussian process regression (linear-GPR), linear function support vector machine (linear-SVM), partial least squares regression (PLSR), multiple linear regression (MLR) and principal component regression (PCR)], three analogizers [radial basis function support vector machine (rbf-SVM), K-nearest neighbor (KNN) and radial basis function Gaussian process regression (rbf-GPR)], six symbolists [extreme gradient boosting (XGBoost), Cubist, random forest (RF), multiple adaptive regression splines (MARS), gradient boosting machine (GBM), and classification and regression tree (CART)] and two connectionists [principal component analysis artificial neural network (pca-ANN) and deep neural network (DNN)] were employed to learn the regression-based QSAR models for 14 public data sets comprising nine physicochemical properties and five toxicity endpoints. The results show that rbf-SVM, rbf-GPR, XGBoost and DNN generally illustrate better performances than the other algorithms. The overall performances of different algorithms can be ranked from the best to the worst as follows: rbf-SVM>XGBoost > rbf-GPR>Cubist > GBM>DNN>RF>pca-ANN>MARS > linear-GPR approximate to KNN>linear-SVM approximate to PLSR > CART approximate to PCR approximate to MLR. In terms of prediction accuracy and computational efficiency, SVM and XGBoost are recommended to the regression learning for small data sets, and XGBoost is an excellent choice for large data sets. We then investigated the performances of the ensemble models by integrating the predictions of multiple ML algorithms. The results illustrate that the ensembles of two or three algorithms in different categories can indeed improve the predictions of the best individual ML algorithms. [GRAPHICS]

Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets

期刊

BRIEFINGS IN BIOINFORMATICS

出版社

OXFORD UNIV PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets

期刊

BRIEFINGS IN BIOINFORMATICS

出版社

OXFORD UNIV PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文