4.7 Article

Revisiting the negative example sampling problem for predicting protein-protein interactions

期刊

BIOINFORMATICS
卷 27, 期 21, 页码 3024-3028

出版社

OXFORD UNIV PRESS
DOI: 10.1093/bioinformatics/btr514

关键词

-

资金

  1. National Institutes of Health [GM067779, GM088624]
  2. Welch [F1515]
  3. Packard Foundations
  4. U.S. Army Research [58343-MA]
  5. Deutsche Forschungsgemeinschaft (DFG-Forschungsstipendium)

向作者/读者索取更多资源

Motivation: A number of computational methods have been proposed that predict protein-protein interactions (PPIs) based on protein sequence features. Since the number of potential non-interacting protein pairs ( negative PPIs) is very high both in absolute terms and in comparison to that of interacting protein pairs ( positive PPIs), computational prediction methods rely upon subsets of negative PPIs for training and validation. Hence, the need arises for subset sampling for negative PPIs. Results: We clarify that there are two fundamentally different types of subset sampling for negative PPIs. One is subset sampling for cross-validated testing, where one desires unbiased subsets so that predictive performance estimated with them can be safely assumed to generalize to the population level. The other is subset sampling for training, where one desires the subsets that best train predictive algorithms, even if these subsets are biased. We show that confusion between these two fundamentally different types of subset sampling led one study recently published in Bioinformatics to the erroneous conclusion that predictive algorithms based on protein sequence features are hardly better than random in predicting PPIs. Rather, both protein sequence features and the 'hubbiness' of interacting proteins contribute to effective prediction of PPIs. We provide guidance for appropriate use of random versus balanced sampling.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据