4.8 Article

Large language models generate functional protein sequences across diverse families

期刊

NATURE BIOTECHNOLOGY
卷 41, 期 8, 页码 1099-+

出版社

NATURE PORTFOLIO
DOI: 10.1038/s41587-022-01618-2

关键词

-

向作者/读者索取更多资源

A generative deep-learning model called ProGen is capable of designing artificial proteins with specific enzymatic activities. This model was trained on a large dataset of protein sequences and can generate protein sequences with predictable functions. ProGen can be further fine-tuned to improve the controllable generation of proteins from specific families, and artificial proteins generated by ProGen showed similar catalytic efficiencies to natural lysozymes.
A generative deep-learning model designs artificial proteins with desired enzymatic activities. Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据