4.6 Article

Machine learning reveals sequence-function relationships in family 7 glycoside hydrolases

期刊

JOURNAL OF BIOLOGICAL CHEMISTRY
卷 297, 期 2, 页码 -

出版社

ELSEVIER
DOI: 10.1016/j.jbc.2021.100931

关键词

-

资金

  1. National Science Foundation [CBET-1552355]
  2. U.S. Department of Energy Office of Energy Efficiency and Renewable Energy Bioenergy Technologies Office
  3. NSF

向作者/读者索取更多资源

GH7 family glycoside hydrolases are key enzymes in cellulose degradation, including cellobiohydrolases and endoglucanases. Machine-learning models show that the number of residues in the active-site loops strongly correlates with functional subtype across the GH7 family, with specific residues at certain sequence positions accurately predicting the subtype. Additionally, a random forest model trained on residues in the catalytic domain can predict the presence of a carbohydrate-binding module with high accuracy.
Family 7 glycoside hydrolases (GH7) are among the principal enzymes for cellulose degradation in nature and industrially. These enzymes are often bimodular, including a catalytic domain and carbohydrate-binding module (CBM) attached via a flexible linker, and exhibit an active site that binds cello-oligomers of up to ten glucosyl moieties. GH7 cellulases consist of two major subtypes: cellobiohydrolases (CBH) and endoglucanases (EG). Despite the critical importance of GH7 enzymes, there remain gaps in our understanding of how GH7 sequence and structure relate to function. Here, we employed machine learning to gain data-driven insights into relation-ships between sequence, structure, and function across the GH7 family. Machine-learning models, trained only on the number of residues in the active-site loops as features, were able to discriminate GH7 CBHs and EGs with up to 99% ac-curacy, demonstrating that the lengths of loops A4, B2, B3, and B4 strongly correlate with functional subtype across the GH7 family. Classification rules were derived such that specific residues at 42 different sequence positions each predicted the functional subtype with accuracies surpassing 87%. A random forest model trained on residues at 19 positions in the catalytic domain predicted the presence of a CBM with 89.5% accuracy. Our machine learning results recapitulate, as top-performing features, a substantial number of the sequence positions determined by previous experimental studies to play vital roles in GH7 activity. We surmise that the yet-to-be-explored sequence positions among the top-performing features also contribute to GH7 functional variation and may be exploited to understand and manipulate function.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据