4.5 Article

A Cascade Random Forests Algorithm for Predicting Protein-Protein Interaction Sites

Journal

IEEE TRANSACTIONS ON NANOBIOSCIENCE
Volume 14, Issue 7, Pages 746-760

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TNB.2015.2475359

Keywords

Cascade random forests; imbalanced learning; protein-protein interaction sites; random forests; sequence-based predictor

Funding

  1. National Natural Science Foundation of China [61373062, 61233011, 61222306]
  2. Jiangsu Postdoctoral Science Foundation [1201027C]
  3. Natural Science Foundation of Jiangsu [BK20141403]
  4. China Postdoctoral Science Foundation [2014T70526, 2013M530260]
  5. Fundamental Research Funds for Central Universities [30920130111010]
  6. The Six Top Talents of Jiangsu Province [2013-XXRJ-022]

Ask authors/readers for more resources

Protein-protein interactions exist ubiquitously and play important roles in the life cycles of living cells. The interaction sites (residues) are essential to understanding the underlying mechanisms of protein-protein interactions. Previous research has demonstrated that the accurate identification of protein-protein interaction sites (PPIs) is helpful for developing new therapeutic drugs because many drugs will interact directly with those residues. Because of its significant potential in biological research and drug development, the prediction of PPIs has become an important topic in computational biology. However, a severe data imbalance exists in the PPIs prediction problem, where the number of the majority class samples (non-interacting residues) is far larger than that of the minority class samples (interacting residues). Thus, we developed a novel cascade random forests algorithm (CRF) to address the serious data imbalance that exists in the PPIs prediction problem. The proposed CRF resolves the negative effect of data imbalance by connecting multiple random forests in a cascade-like manner, each of which is trained with a balanced training subset that includes all minority samples and a subset of majority samples using an effective ensemble protocol. Based on the proposed CRF, we implemented a new sequence-based PPIs predictor, called CRF-PPI, which takes the combined features of position-specific scoring matrices, averaged cumulative hydropathy, and predicted relative solvent accessibility as model inputs. Benchmark experiments on both the cross validation and independent validation datasets demonstrated that the proposed CRF-PPI outperformed the state-of-the-art sequence-based PPIs predictors. The source code for CRF-PPI and the benchmark datasets are available online at http://csbio.njust.edu.cn/bioinf/CRF-PPI for free academic use.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available