4.7 Article Proceedings Paper

Techniques to cope with missing data in host-pathogen protein interaction prediction

期刊

BIOINFORMATICS
卷 28, 期 18, 页码 I466-I472

出版社

OXFORD UNIV PRESS
DOI: 10.1093/bioinformatics/bts375

关键词

-

资金

  1. NIGMS NIH HHS [P50GM082251] Funding Source: Medline
  2. NLM NIH HHS [2R01LM007994-05] Funding Source: Medline

向作者/读者索取更多资源

Motivation: Approaches that use supervised machine learning techniques for protein-protein interaction (PPI) prediction typically use features obtained by integrating several sources of data. Often certain attributes of the data are not available, resulting in missing values. In particular, our host-pathogen PPI datasets have a large fraction, in the range of 58-85% of missing values, which makes it challenging to apply machine learning algorithms. Results: We show that specialized techniques for missing value imputation can improve the performance of the models significantly. We use cross species information in combination with machine learning techniques like Group lasso with l(1)/l(2) regularization. We demonstrate the benefits of our approach on two PPI prediction problems. In our first example of Salmonella-human PPI prediction, we are able to obtain high prediction accuracies with 77.6% precision and 84% recall. Comparison with various other techniques shows an improvement of 9 in F1 score over the next best technique. We also apply our method to Yersinia-human PPI prediction successfully, demonstrating the generality of our approach.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据