4.8 Article

Transformer-based protein generation with regularized latent space optimization

期刊

NATURE MACHINE INTELLIGENCE
卷 4, 期 10, 页码 840-851

出版社

NATURE PORTFOLIO
DOI: 10.1038/s42256-022-00532-1

关键词

-

资金

  1. National Library of Medicine training grant [LM007056]
  2. Yale-Boehringer Ingelheim Biomedical Data Science Fellowship
  3. NIGMS [R01GM135929, R01GM130847]
  4. NSF Career Grant [2047856]
  5. Chan-Zuckerberg Initiative Grants [CZF2019-182702, CZF2019-002440]
  6. Sloan Fellowship [FG-2021-15883]
  7. Div Of Information & Intelligent Systems
  8. Direct For Computer & Info Scie & Enginr [2047856] Funding Source: National Science Foundation

向作者/读者索取更多资源

In this study, Castro and colleagues propose a smooth and pseudoconvex latent space for easier navigation and more efficient optimization of proteins. Using the ReLSO deep transformer-based autoencoder, they explicitly model the sequence-function landscape and generate new molecules by optimizing within the latent space. ReLSO outperforms other approaches in terms of sequence optimization efficiency and robust generation of high-fitness sequences.
The space of possible proteins is vast, and optimizing proteins for specific target properties computationally is an ongoing challenge, even with large amounts of data. Castro and colleagues combine a transformer-based model with regularized prediction heads to form a smooth and pseudoconvex latent space that allows for easier navigation and more efficient optimization of proteins. The development of powerful natural language models has improved the ability to learn meaningful representations of protein sequences. In addition, advances in high-throughput mutagenesis, directed evolution and next-generation sequencing have allowed for the accumulation of large amounts of labelled fitness data. Leveraging these two trends, we introduce Regularized Latent Space Optimization (ReLSO), a deep transformer-based autoencoder, which features a highly structured latent space that is trained to jointly generate sequences as well as predict fitness. Through regularized prediction heads, ReLSO introduces a powerful protein sequence encoder and a novel approach for efficient fitness landscape traversal. Using ReLSO, we explicitly model the sequence-function landscape of large labelled datasets and generate new molecules by optimizing within the latent space using gradient-based methods. We evaluate this approach on several publicly available protein datasets, including variant sets of anti-ranibizumab and green fluorescent protein. We observe a greater sequence optimization efficiency (increase in fitness per optimization step) using ReLSO compared with other approaches, where ReLSO more robustly generates high-fitness sequences. Furthermore, the attention-based relationships learned by the jointly trained ReLSO models provide a potential avenue towards sequence-level fitness attribution information.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据