4.6 Article

Generalized Property-Based Encoders and Digital Signal Processing Facilitate Predictive Tasks in Protein Engineering

期刊

出版社

FRONTIERS MEDIA SA
DOI: 10.3389/fmolb.2022.898627

关键词

protein engineering; predictive models; machine learning; digital signal processing; fourier transform; numerical representation strategies

向作者/读者索取更多资源

This study proposes a method to improve the performance of predictive models in protein engineering by applying physicochemical properties for encoding strategies. Through partitioning the AAIndex database into semantically-consistent groups and using non-linear PCA to define encoders, the authors demonstrate that models using these encoders outperform classical approaches in protein and peptide function prediction, folding prediction, and biological activity prediction. Additionally, a methodology to create new sequences with desired properties is proposed. Overall, this study provides simple ways to enhance predictive tasks in protein engineering without increasing complexity.
Computational methods in protein engineering often require encoding amino acid sequences, i.e., converting them into numeric arrays. Physicochemical properties are a typical choice to define encoders, where we replace each amino acid by its value for a given property. However, what property (or group thereof) is best for a given predictive task remains an open problem. In this work, we generalize property-based encoding strategies to maximize the performance of predictive models in protein engineering. First, combining text mining and unsupervised learning, we partitioned the AAIndex database into eight semantically-consistent groups of properties. We then applied a non-linear PCA within each group to define a single encoder to represent it. Then, in several case studies, we assess the performance of predictive models for protein and peptide function, folding, and biological activity, trained using the proposed encoders and classical methods (One Hot Encoder and TAPE embeddings). Models trained on datasets encoded with our encoders and converted to signals through the Fast Fourier Transform (FFT) increased their precision and reduced their overfitting substantially, outperforming classical approaches in most cases. Finally, we propose a preliminary methodology to create de novo sequences with desired properties. All these results offer simple ways to increase the performance of general and complex predictive tasks in protein engineering without increasing their complexity.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据