4.8 Article

Learning the molecular grammar of protein condensates from sequence determinants and embeddings

出版社

NATL ACAD SCIENCES
DOI: 10.1073/pnas.2019053118

关键词

liquid-liquid phase separation; biomolecular condensates; protein biopysis; machine learning; language models

资金

  1. Schmidt Science Fellows program
  2. Rhodes Trust
  3. St John's College Junior Research Fellowship
  4. Trinity College Krishnan-Ang Studentship
  5. Honorary Trinity-Henry Barlow Scholarship
  6. Engineering and Physical Sciences Research Council (EPSRC) Centre for Doctoral Training in NanoScience and Nanotechnology (NanoDTC) [EP/L015978/1]
  7. EPSRC Impact Acceleration Program
  8. European Research Council under the European Union's Horizon 2020 Framework Program through the Marie Sklodowska-Curie Grant MicroSPARK [841466]
  9. Herchel Smith Fund of the University of Cambridge
  10. Wolfson College Junior Research Fellowship
  11. European Research Council under the European Union's Seventh Framework Program (FP7/2007-2013) through the European Research Council Grant PhysProt [337969]
  12. Newman Foundation

向作者/读者索取更多资源

Research has shown that proteins prone to liquid-liquid phase separation are more disordered, less hydrophobic, and have lower Shannon entropy than other protein sequences. By using machine learning models and neural network language models, it is possible to predict and understand protein phase behavior effectively.
Intracellular phase separation of proteins into biomolecular condensates is increasingly recognized as a process with a key role in cellular compartmentalization and regulation. Different hypotheses about the parameters that determine the tendency of proteins to form condensates have been proposed, with some of them probed experimentally through the use of constructs generated by sequence alterations. To broaden the scope of these observations, we established an in silico strategy for understanding on a global level the associations between protein sequence and phase behavior and further constructed machine learning models for predicting protein liquid?liquid phase separation (LLPS). Our analysis highlighted that LLPS-prone proteins are more disordered, less hydrophobic, and of lower Shannon entropy than sequences in the Protein Data Bank or the Swiss Prot database and that they show a fine balance in their relative content of polar and hydrophobic residues. To further learn in a hypothesis-free manner the sequence features underpinning LLPS, we trained a neural network-based language model and found that a classifier constructed on such embeddings learned the underlying principles of phase behavior at a comparable accuracy to a classifier that used knowledge-based features. By combining knowledge-based features with unsupervised embed dings, we generated an integrated model that distinguished LLPS-prone sequences both from structured proteins and from unstructured proteins with a lower LLPS propensity and further identified such sequences from the human proteome at a high accuracy. These results provide a platform rooted in molecular principles for understanding protein phase behavior. The predictor, termed DeePhase, is accessible from https://deephase.ch. cam.ac.uk/.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据