☆ 4.6 Article

Machine learning based predictive model for the analysis of sequence activity relationships using protein spectra and protein descriptors

JOURNAL OF BIOMEDICAL INFORMATICS (2022)

期刊

JOURNAL OF BIOMEDICAL INFORMATICS

卷 128, 期 -, 页码 -

出版社

ACADEMIC PRESS INC ELSEVIER SCIENCE

DOI: 10.1016/j.jbi.2022.104016

关键词

Digital Signal Processing; Directed Evolution; Machine Learning; Protein Spectra; Physiochemical Descriptors

类别

Computer Science, Interdisciplinary Applications Medical Informatics

资金

Google's Google Cloud Academic Research Grant under Google Cloud COVID-19

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Accurately predicting the effects of protein mutations is a key focus in protein engineering. This study evaluates encoding strategies for protein sequences using the Amino Acid Index database. By transforming the indices into their spectral form and combining them with protein structural and physiochemical descriptors, as well as using the Partial Least Squares Regression algorithm, predictive models with improved quality were built. The findings highlight the utility of this encoding strategy in identifying the Sequence-Activity-Relationship (SAR).

Accurately establishing the connection between a protein sequence and its function remains a focal point within the field of protein engineering, especially in the context of predicting the effects of mutations. From this, there has been a continued drive to build accurate and reliable predictive models via machine learning that allow for the virtual screening of many protein mutant sequences, measuring the relationship between sequence and 'fitness' or 'activity', commonly known as a Sequence-Activity-Relationship (SAR). An important preliminary stage in the building of these predictive models is the encoding of the chosen sequences. Evaluated in this work is a plethora of encoding strategies using the Amino Acid Index database, where the indices are transformed into their spectral form via Digital Signal Processing (DSP) techniques, as well as numerous protein structural and physiochemical descriptors. The encoding strategies are explored on a dataset curated to measure the thermostability of various mutants from a recombination library, designed from parental cytochrome P450s. In this work it was concluded that the implementation of protein spectra in concatenation with protein descriptors, together with the Partial Least Squares Regression (PLS) algorithm, gave the most noteworthy increase in the quality of the predictive models (as described in Encoding Strategy C), highlighting their utility in identifying an SAR. The accompanying software produced for this paper is termed pySAR (Python Sequence-Activity Relationship), which allows for a user to find the optimal arrangement of structural and or physiochemical properties to encode their specific mutant library dataset; the source code is available at: https://github. com/amckenna41/pySAR.

Machine learning based predictive model for the analysis of sequence activity relationships using protein spectra and protein descriptors

期刊

JOURNAL OF BIOMEDICAL INFORMATICS

出版社

ACADEMIC PRESS INC ELSEVIER SCIENCE

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Machine learning based predictive model for the analysis of sequence activity relationships using protein spectra and protein descriptors

期刊

JOURNAL OF BIOMEDICAL INFORMATICS

出版社

ACADEMIC PRESS INC ELSEVIER SCIENCE

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文