4.8 Article

Learning meaningful representations of protein sequences

期刊

NATURE COMMUNICATIONS
卷 13, 期 1, 页码 -

出版社

NATURE PORTFOLIO
DOI: 10.1038/s41467-022-29443-w

关键词

-

资金

  1. Novo Nordisk Foundation through the MLLS Center [NNF18OC0052719]
  2. European Research Council (ERC) under the European Union [757360, 15334]
  3. VILLUM FONDEN [15334]
  4. NVIDIA Corporation
  5. Novo Nordisk Foundation [NNF18OC0052719]
  6. European Research Council (ERC) [757360] Funding Source: European Research Council (ERC)

向作者/读者索取更多资源

This paper discusses the issue of representation in protein sequence analysis and proposes best practices for ensuring meaningful representations. The research finds that even minor modifications can result in different data representations and biological interpretations, raising the question of what constitutes the most meaningful representation.
Representation learning plays an increasing role in protein sequence analysis. This paper seeks to clarify how to ensure that such representations are meaningful, proposing best practices both for the choice of methods and the subsequence analysis How we choose to represent our data has a fundamental impact on our ability to subsequently extract information from them. Machine learning promises to automatically determine efficient representations from large unstructured datasets, such as those arising in biology. However, empirical evidence suggests that seemingly minor changes to these machine learning models yield drastically different data representations that result in different biological interpretations of data. This begs the question of what even constitutes the most meaningful representation. Here, we approach this question for representations of protein sequences, which have received considerable attention in the recent literature. We explore two key contexts in which representations naturally arise: transfer learning and interpretable learning. In the first context, we demonstrate that several contemporary practices yield suboptimal performance, and in the latter we demonstrate that taking representation geometry into account significantly improves interpretability and lets the models reveal biological information that is otherwise obscured.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据