4.6 Article

EnsembleFam: towards more accurate protein family prediction in the twilight zone

Journal

BMC BIOINFORMATICS
Volume 23, Issue 1, Pages -

Publisher

BMC
DOI: 10.1186/s12859-022-04626-w

Keywords

Protein function prediction; Twilight zone sequence; Sequence homology; Support vector machine; Ensemble classifier

Funding

  1. National Research Foundation, Prime Minister's Office, Singapore under its Synthetic Biology Research and Development Programme [SBP-P3]
  2. Kwan Im Thong Hood Cho Temple chair professorship

Ask authors/readers for more resources

This study presents a novel method called EnsembleFam that aims to improve function prediction for proteins in the twilight zone. By extracting core characteristics and using SVM classifiers, EnsembleFam achieves better accuracy in identifying proteins with low sequence homology compared to existing methods.
Background Current protein family modeling methods like profile Hidden Markov Model (pHMM), k-mer based methods, and deep learning-based methods do not provide very accurate protein function prediction for proteins in the twilight zone, due to low sequence similarity to reference proteins with known functions. Results We present a novel method EnsembleFam, aiming at better function prediction for proteins in the twilight zone. EnsembleFam extracts the core characteristics of a protein family using similarity and dissimilarity features calculated from sequence homology relations. EnsembleFam trains three separate Support Vector Machine (SVM) classifiers for each family using these features, and an ensemble prediction is made to classify novel proteins into these families. Extensive experiments are conducted using the Clusters of Orthologous Groups (COG) dataset and G Protein-Coupled Receptor (GPCR) dataset. EnsembleFam not only outperforms state-of-the-art methods on the overall dataset but also provides a much more accurate prediction for twilight zone proteins. Conclusions EnsembleFam, a machine learning method to model protein families, can be used to better identify members with very low sequence homology. Using EnsembleFam protein functions can be predicted using just sequence information with better accuracy than state-of-the-art methods.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available