4.7 Article

Learned protein embeddings for machine learning

期刊

BIOINFORMATICS
卷 34, 期 15, 页码 2642-2648

出版社

OXFORD UNIV PRESS
DOI: 10.1093/bioinformatics/bty178

关键词

-

资金

  1. U.S. Army Research Office Institute for Collaborative Biotechnologies [W911F-09-0001]
  2. Donna and Benjamin M. Rosen Bioengineering Center
  3. National Institutes of Health [F31MH102913]
  4. National Science Foundation [GRF2017227007]

向作者/读者索取更多资源

Motivation: Machine-learning models trained on protein sequences and their measured functions can infer biological properties of unseen sequences without requiring an understanding of the underlying physical or biological mechanisms. Such models enable the prediction and discovery of sequences with optimal properties. Machine-learning models generally require that their inputs be vectors, and the conversion from a protein sequence to a vector representation affects the model's ability to learn. We propose to learn embedded representations of protein sequences that take advantage of the vast quantity of unmeasured protein sequence data available. These embeddings are low-dimensional and can greatly simplify downstream modeling. Results: The predictive power of Gaussian process models trained using embeddings is comparable to those trained on existing representations, which suggests that embeddings enable accurate predictions despite having orders of magnitude fewer dimensions. Moreover, embeddings are simpler to obtain because they do not require alignments, structural data, or selection of informative amino-acid properties. Visualizing the embedding vectors shows meaningful relationships between the embedded proteins are captured. Availability and implementation: The embedding vectors and code to reproduce the results are available at https://github.com/fhalab/embeddings_reproduction/. Contact: frances@cheme.caltech.edu Supplementary information: Supplementary data are available at Bioinformatics online.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据