4.8 Article

Machine-Learning-Guided Library Design Cycle for Directed Evolution of Enzymes: The Effects of Training Data Composition on Sequence Space Exploration

Journal

ACS CATALYSIS
Volume 11, Issue 23, Pages 14615-14624

Publisher

AMER CHEMICAL SOC
DOI: 10.1021/acscatal.1c03753

Keywords

machine learning; mutagenesis; protein engineering; directed evolution; library design; training data; sequence space exploration

Funding

  1. Cross-ministerial Strategic Innovation Promotion Program (SIP) Technologies for Smart Bioindustry and Agriculture (funding agency: Bio-oriented Technology Research Advancement Institution, National Agriculture and Food Research Organization (NARO), Japan
  2. Japan Society for the Promotion of Science Research Fellowships [JP16H04570, JP16K14483, JP20H00315]
  3. project Development of the Key Technologies for the NextGeneration Artificial Intelligence/Robots of the Ministry of Economy, Trade and Industry, Japan

Ask authors/readers for more resources

The study shows that machine learning is a useful tool in designing proteins with desired functions in protein engineering. Depending on the presence or absence of highly positive variants in the training data, machine learning-guided directed evolution can lead to improved variants in different regions of sequence space.
Machine learning (ML) is becoming an attractive tool in mutagenesis-based protein engineering because of its ability to design a variant library containing proteins with a desired function. However, it remains unclear how ML guides directed evolution in sequence space depending on the composition of training data. Here, we present a ML-guided directed evolution study of an enzyme to investigate the effects of a known highly positive variant (i.e., variant known to have high enzyme activity) in training data. We performed two separate series of ML-guided directed evolution of Sortase A with and without a known highly positive variant called 5M in training data. In each series, two rounds of ML were conducted: variants predicted by the initial round were experimentally evaluated and used as additional training data for the second-round of prediction. The improvements in enzyme activity were comparable between the two series, both achieving enzyme activity 2.2-2.5 times higher than 5M. Intriguingly, the sequences of the improved variants were largely different between the two series, indicating that ML guided the directed evolution to the distinct regions of sequence space depending on the presence/absence of the highly positive variant in the training data. This suggests that the sequence diversity of improved variants can be expanded not only by conventional ML using the whole training data but also by ML using a subset of the training data even when it lacks highly positive variants. In summary, this study demonstrates the importance of regulating the composition of training data in ML-guided directed evolution.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.8
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available