☆ 4.7 Article

Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework

BRIEFINGS IN BIOINFORMATICS (2019)

期刊

BRIEFINGS IN BIOINFORMATICS

卷 20, 期 6, 页码 2185-2199

出版社

OXFORD UNIV PRESS

DOI: 10.1093/bib/bby079

关键词

lysine malonylation; computational prediction; feature encoding methods; machine learning; ensemble learning; Light Gradient Boosting Machine

类别

Biochemical Research Methods Mathematical & Computational Biology

资金

Natural Science Foundation of Guangxi [2016GXNSFCA380005]
Innovation Project of Guilin University of Electronic Technology Graduate Education [2018YJCX49]
Australian Research Council (ARC) [LP110200333, DP120104460]
National Institute of Allergy and Infectious Diseases of the National Institutes of Health [R01 AI111965]
Monash University
Discovery Outstanding Research Award of the ARC [DP140100087]
Informatics Institute of the School of Medicine at University of Alabama at Birmingham

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

As a newly discovered post-translational modification (PTM), lysine malonylation (Kmal) regulates a myriad of cellular processes from prokaryotes to eukaryotes and has important implications in human diseases. Despite its functional significance, computational methods to accurately identify malonylation sites are still lacking and urgently needed. In particular, there is currently no comprehensive analysis and assessment of different features and machine learning (ML) methods that are required for constructing the necessary prediction models. Here, we review, analyze and compare 11 different feature encoding methods, with the goal of extracting key patterns and characteristics from residue sequences of Kmal sites. We identify optimized feature sets, with which four commonly used ML methods (random forest, support vector machines, K-nearest neighbor and logistic regression) and one recently proposed [Light Gradient Boosting Machine (LightGBM)] are trained on data from three species, namely, Escherichia coli, Mus musculus and Homo sapiens, and compared using randomized 10-fold cross-validation tests. We show that integration of the single method-based models through ensemble learning further improves the prediction performance and model robustness on the independent test. When compared to the existing state-of-the-art predictor, MaloPred, the optimal ensemble models were more accurate for all three species (AUC: 0.930, 0.923 and 0.944 for E. coli, M. musculus and H. sapiens, respectively). Using the ensemble models, we developed an accessible online predictor, kmal-sp, available at http://kmalsp.erc.monash.edu/. We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for PTM site prediction, expedite the discovery of new malonylation and other PTM types and facilitate hypothesis-driven experimental validation of novel malonylated substrates and malonylation sites.

Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework

期刊

BRIEFINGS IN BIOINFORMATICS

出版社

OXFORD UNIV PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework

期刊

BRIEFINGS IN BIOINFORMATICS

出版社

OXFORD UNIV PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文