4.7 Article

Alignment-free estimation of sequence conservation for identifying functional sites using protein sequence embeddings

期刊

BRIEFINGS IN BIOINFORMATICS
卷 24, 期 1, 页码 -

出版社

OXFORD UNIV PRESS
DOI: 10.1093/bib/bbac599

关键词

protein language models; sequence conservation; functional site prediction; deep learning

向作者/读者索取更多资源

Protein language modeling is a new deep learning method in bioinformatics with various applications. This study presents a method for estimating sequence conservation using sequence embeddings generated from protein language models. The ESM2 models show the best performance to computational cost ratio for conservation estimation. The method can identify conserved functional sites in any full-length protein sequence and estimate conservation without the need for sequence alignment.
Protein language modeling is a fast-emerging deep learning method in bioinformatics with diverse applications such as structure prediction and protein design. However, application toward estimating sequence conservation for functional site prediction has not been systematically explored. Here, we present a method for the alignment-free estimation of sequence conservation using sequence embeddings generated from protein language models. Comprehensive benchmarks across publicly available protein language models reveal that ESM2 models provide the best performance to computational cost ratio for conservation estimation. Applying our method to full-length protein sequences, we demonstrate that embedding-based methods are not sensitive to the order of conserved elements conservation scores can be calculated for multidomain proteins in a single run, without the need to separate individual domains. Our method can also identify conserved functional sites within fast-evolving sequence regions (such as domain inserts), which we demonstrate through the identification of conserved phosphorylation motifs in variable insert segments in protein kinases. Overall, embedding-based conservation analysis is a broadly applicable method for identifying potential functional sites in any full-length protein sequence and estimating conservation in an alignment-free manner. To run this on your protein sequence of interest, try our scripts at https://github.com/esbgkannan/kibby.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据