4.7 Article

OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches

Journal

BIOINFORMATICS
Volume 37, Issue 18, Pages 2866-2873

Publisher

OXFORD UNIV PRESS
DOI: 10.1093/bioinformatics/btab219

Keywords

-

Funding

  1. Swiss National Foundation [167276, 183723]

Ask authors/readers for more resources

Assigning new sequences to known protein families and subfamilies is crucial for many functional, comparative and evolutionary genomics analyses. However, relying solely on the closest sequence in a reference database for assignment can lead to misassignments, as a query sequence may not necessarily belong to the same subfamily as its closest sequence. To overcome this issue, a novel alignment-free protein subfamily assignment method called OMAmer has been introduced, which provides better and quicker subfamily-level assignments compared to methods relying on the closest sequence.
Motivation: Assigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative and evolutionary genomics analyses. Such assignment is commonly achieved by looking for the closest sequence in a reference database, using a method such as BLAST. However, ignoring the gene phylogeny can be misleading because a query sequence does not necessarily belong to the same subfamily as its closest sequence. For example, a hemoglobin which branched out prior to the hemoglobin alpha/beta duplication could be closest to a hemoglobin alpha or beta sequence, whereas it is neither. To overcome this problem, phylogeny-driven tools have emerged but rely on gene trees, whose inference is computationally expensive. Results: Here, we first show that in multiple animal and plant datasets, 18-62% of assignments by closest sequence are misassigned, typically to an over-specific subfamily. Then, we introduce OMAmer, a novel alignment-free protein subfamily assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. OMAmer is based on an innovative method using evolutionarily informed k-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, we show that OMAmer provides better and quicker subfamily-level assignments than approaches relying on the closest sequence, whether inferred exactly by Smith-Waterman or by the fast heuristic DIAMOND.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available