☆ 4.7 Article

DNAPred: Accurate Identification of DNA-Binding Sites from Protein Sequence by Ensembled Hyperplane-Distance-Based Support Vector Machines

JOURNAL OF CHEMICAL INFORMATION AND MODELING (2019)

期刊

JOURNAL OF CHEMICAL INFORMATION AND MODELING

卷 59, 期 6, 页码 3057-3071

出版社

AMER CHEMICAL SOC

DOI: 10.1021/acs.jcim.8b00749

关键词

类别

Chemistry, Medicinal Chemistry, Multidisciplinary Computer Science, Information Systems Computer Science, Interdisciplinary Applications

资金

National Natural Science Foundation of China [61772273, 61373062, 61876072]
Fundamental Research Funds for the Central Universities [30918011104]
Natural Science Foundation of Anhui Province of China [KJ2018A0572]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Accurate identification of proteinDNA binding sites is significant for both understanding protein function and drug design. Machine-learning-based methods have been extensively used for the prediction of proteinDNA binding sites. However, the data imbalance problem, in which the number of nonbinding residues (negative-class samples) is far larger than that of binding residues (positive-class samples), seriously restricts the performance improvements of machine-learning-based predictors. In this work, we designed a two-stage imbalanced learning algorithm, called ensembled hyperplane-distance-based support vector machines (E-HDSVM), to improve the prediction performance of proteinDNA binding sites. The first stage of E-HDSVM designs a new iterative sampling algorithm, called hyperplane-distance-based under-sampling (HD-US), to extract multiple subsets from the original imbalanced data set, each of which is used to train a support vector machine (SVM). Unlike traditional sampling algorithms, HD-US selects samples by calculating the distances between the samples and the separating hyperplane of the SVM. The second stage of E-HDSVM proposes an enhanced AdaBoost (EAdaBoost) algorithm to ensemble multiple trained SVMs. As an enhanced version of the original AdaBoost algorithm, EAdaBoost overcomes the overfitting problem. Stringent cross-validation and independent tests on benchmark data sets demonstrated the superiority of E-HDSVM over several popular imbalanced learning algorithms. Based on the proposed E-HDSVM algorithm, we further implemented a sequence-based proteinDNA binding site predictor, called DNAPred, which is freely available at http://csbio.njust.edu.cn/bioinf/dnapred/ for academic use. The computational experimental results showed that our predictor achieved an average overall accuracy of 91.7% and a Mathews correlation coefficient of 0.395 on five benchmark data sets and outperformed several state-of-the-art sequence-based proteinDNA binding site predictors.

DNAPred: Accurate Identification of DNA-Binding Sites from Protein Sequence by Ensembled Hyperplane-Distance-Based Support Vector Machines

期刊

JOURNAL OF CHEMICAL INFORMATION AND MODELING

出版社

AMER CHEMICAL SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

DNAPred: Accurate Identification of DNA-Binding Sites from Protein Sequence by Ensembled Hyperplane-Distance-Based Support Vector Machines

期刊

JOURNAL OF CHEMICAL INFORMATION AND MODELING

出版社

AMER CHEMICAL SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文