4.7 Article

Inferring Protein Sequence-Function Relationships with Large-Scale Positive-Unlabeled Learning

Journal

CELL SYSTEMS
Volume 12, Issue 1, Pages 92-+

Publisher

CELL PRESS
DOI: 10.1016/j.cels.2020.10.007

Keywords

-

Funding

  1. NIH [R35 GM119854, R01 GM131381]
  2. NSF [DMS-1811767]

Ask authors/readers for more resources

The study utilizes a positive-unlabeled learning framework to infer sequence-function relationships from large-scale experimental data, demonstrating excellent predictive performance. The estimated parameters help pinpoint key residues that dictate protein structure and function, ultimately applied in designing highly stabilized enzymes.
Machine learning can infer how protein sequence maps to function without requiring a detailed understanding of the underlying physical or biological mechanisms. It is challenging to apply existing supervised learning frameworks to large-scale experimental data generated by deep mutational scanning (DMS) and related methods. DMS data often contain high-dimensional and correlated sequence variables, experimental sampling error and bias, and the presence of missing data. Notably, most DMS data do not contain examples of negative sequences, making it challenging to directly estimate how sequence affects function. Here, we develop a positive-unlabeled (PU) learning framework to infer sequence-function relationships from large-scale DMS data. Our PU learning method displays excellent predictive performance across ten large-scale sequence-function datasets, representing proteins of different folds, functions, and library types. The estimated parameters pinpoint key residues that dictate protein structure and function. Finally, we apply our statistical sequence-function model to design highly stabilized enzymes.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available