4.6 Article

Machine learning based predictive model for the analysis of sequence activity relationships using protein spectra and protein descriptors

期刊

JOURNAL OF BIOMEDICAL INFORMATICS
卷 128, 期 -, 页码 -

出版社

ACADEMIC PRESS INC ELSEVIER SCIENCE
DOI: 10.1016/j.jbi.2022.104016

关键词

Digital Signal Processing; Directed Evolution; Machine Learning; Protein Spectra; Physiochemical Descriptors

资金

  1. Google's Google Cloud Academic Research Grant under Google Cloud COVID-19

向作者/读者索取更多资源

Accurately predicting the effects of protein mutations is a key focus in protein engineering. This study evaluates encoding strategies for protein sequences using the Amino Acid Index database. By transforming the indices into their spectral form and combining them with protein structural and physiochemical descriptors, as well as using the Partial Least Squares Regression algorithm, predictive models with improved quality were built. The findings highlight the utility of this encoding strategy in identifying the Sequence-Activity-Relationship (SAR).
Accurately establishing the connection between a protein sequence and its function remains a focal point within the field of protein engineering, especially in the context of predicting the effects of mutations. From this, there has been a continued drive to build accurate and reliable predictive models via machine learning that allow for the virtual screening of many protein mutant sequences, measuring the relationship between sequence and 'fitness' or 'activity', commonly known as a Sequence-Activity-Relationship (SAR). An important preliminary stage in the building of these predictive models is the encoding of the chosen sequences. Evaluated in this work is a plethora of encoding strategies using the Amino Acid Index database, where the indices are transformed into their spectral form via Digital Signal Processing (DSP) techniques, as well as numerous protein structural and physiochemical descriptors. The encoding strategies are explored on a dataset curated to measure the thermostability of various mutants from a recombination library, designed from parental cytochrome P450s. In this work it was concluded that the implementation of protein spectra in concatenation with protein descriptors, together with the Partial Least Squares Regression (PLS) algorithm, gave the most noteworthy increase in the quality of the predictive models (as described in Encoding Strategy C), highlighting their utility in identifying an SAR. The accompanying software produced for this paper is termed pySAR (Python Sequence-Activity Relationship), which allows for a user to find the optimal arrangement of structural and or physiochemical properties to encode their specific mutant library dataset; the source code is available at: https://github. com/amckenna41/pySAR.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据