☆ 4.5 Article

A Cascade Random Forests Algorithm for Predicting Protein-Protein Interaction Sites

IEEE TRANSACTIONS ON NANOBIOSCIENCE (2015)

Journal

IEEE TRANSACTIONS ON NANOBIOSCIENCE

Volume 14, Issue 7, Pages 746-760

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TNB.2015.2475359

Keywords

Cascade random forests; imbalanced learning; protein-protein interaction sites; random forests; sequence-based predictor

Funding

National Natural Science Foundation of China [61373062, 61233011, 61222306]
Jiangsu Postdoctoral Science Foundation [1201027C]
Natural Science Foundation of Jiangsu [BK20141403]
China Postdoctoral Science Foundation [2014T70526, 2013M530260]
Fundamental Research Funds for Central Universities [30920130111010]
The Six Top Talents of Jiangsu Province [2013-XXRJ-022]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Protein-protein interactions exist ubiquitously and play important roles in the life cycles of living cells. The interaction sites (residues) are essential to understanding the underlying mechanisms of protein-protein interactions. Previous research has demonstrated that the accurate identification of protein-protein interaction sites (PPIs) is helpful for developing new therapeutic drugs because many drugs will interact directly with those residues. Because of its significant potential in biological research and drug development, the prediction of PPIs has become an important topic in computational biology. However, a severe data imbalance exists in the PPIs prediction problem, where the number of the majority class samples (non-interacting residues) is far larger than that of the minority class samples (interacting residues). Thus, we developed a novel cascade random forests algorithm (CRF) to address the serious data imbalance that exists in the PPIs prediction problem. The proposed CRF resolves the negative effect of data imbalance by connecting multiple random forests in a cascade-like manner, each of which is trained with a balanced training subset that includes all minority samples and a subset of majority samples using an effective ensemble protocol. Based on the proposed CRF, we implemented a new sequence-based PPIs predictor, called CRF-PPI, which takes the combined features of position-specific scoring matrices, averaged cumulative hydropathy, and predicted relative solvent accessibility as model inputs. Benchmark experiments on both the cross validation and independent validation datasets demonstrated that the proposed CRF-PPI outperformed the state-of-the-art sequence-based PPIs predictors. The source code for CRF-PPI and the benchmark datasets are available online at http://csbio.njust.edu.cn/bioinf/CRF-PPI for free academic use.

A Cascade Random Forests Algorithm for Predicting Protein-Protein Interaction Sites

Journal

IEEE TRANSACTIONS ON NANOBIOSCIENCE

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

A Cascade Random Forests Algorithm for Predicting Protein-Protein Interaction Sites

Journal

IEEE TRANSACTIONS ON NANOBIOSCIENCE

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper