4.7 Article Proceedings Paper

Formator: Predicting Lysine Formylation Sites Based on the Most Distant Undersampling and Safe-Level Synthetic Minority Oversampling

出版社

IEEE COMPUTER SOC
DOI: 10.1109/TCBB.2019.2957758

关键词

Amino acids; Feature extraction; Training; Protein sequence; Support vector machines; Tools; Protein post-translational modification; lysine formylation site prediction; sequence analysis; ensemble learning; resampling techniques

资金

  1. Fundamental Research Funds for the Central Universities [3132019175, 3132019323, 3132018230]
  2. National Natural Science Foundation of Liaoning Province [20180550307]
  3. National Scholarship Fund of China for Studying Abroad
  4. National Health and Medical Research Council of Australia (NHMRC) [APP490989, APP1127948, APP1144652]
  5. Australian Research Council (ARC) [LP110200333, DP120104460]
  6. National Institute of Allergy and Infectious Diseases of the National Institutes of Health [R01 AI111965]
  7. Major Inter-Disciplinary Research (IDR) project awarded by Monash University
  8. [CXXM2019SS022]

向作者/读者索取更多资源

Formator is a novel predictor developed for identifying lysine formylation sites, achieving high accuracy through ensemble learning strategy and feature extraction methods. Empirical studies demonstrate its superior performance compared to existing prediction tools, indicating great potential for identifying novel lysine formylation sites.
Lysine formylation is a reversible type of protein post-translational modification and has been found to be involved in a myriad of biological processes, including modulation of chromatin conformation and gene expression in histones and other nuclear proteins. Accurate identification of lysine formylation sites is essential for elucidating the underlying molecular mechanisms of formylation. Traditional experimental methods are time-consuming and expensive. As such, it is desirable and necessary to develop computational methods for accurate prediction of formylation sites. In this study, we propose a novel predictor, termed Formator, for identifying lysine formylation sites from sequences information. Formator is developed using the ensemble learning (EL) strategy based on four individual support vector machine classifiers via a voting system. Moreover, the most distant undersampling and Safe-Level-SMOTE oversampling techniques were integrated to deal with the data imbalance problem of the training dataset. Four effective feature extraction methods, namely bi-profile Bayes (BPB), k-nearest neighbor (KNN), amino acid physicochemical properties (AAindex), and composition and transition (CTD) were employed to encode the surrounding sequence features of potential formylation sites. Extensive empirical studies show that Formator achieved the accuracy of 87.24 and 74.96 percent on jackknife test and the independent test, respectively. Performance comparison results on the independent test indicate that Formator outperforms current existing prediction tool, LFPred, suggesting that it has a great potential to serve as a useful tool in identifying novel lysine formylation sites and facilitating hypothesis-driven experimental efforts.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据