4.7 Article

Capturing whole-genome characteristics in short sequences using a naive Bayesian classifier

Journal

GENOME RESEARCH
Volume 11, Issue 8, Pages 1404-1409

Publisher

COLD SPRING HARBOR LAB PRESS
DOI: 10.1101/gr.186401

Keywords

-

Ask authors/readers for more resources

Bacterial genomes have diverged during evolution, resulting in clearcut differences in their nucleotide ge composition, such as their GC content. The analysis of complete sequences of bacterial genomes also reveals the presence of nonrandom. sequence variation, manifest in the frequency profile of specific short oligonucleotides. These frequency profiles constitute highly specific genomic signatures. Based on these differences in oligonucleotide frequency between bacterial genomes, we investigated the possibility of predicting the genome of origin for a specific genomic sequence. To this end, we developed a naive Bayesian classifier and systematically analyzed 28 eubacterial and archaeal genomes. We found that sequences as short as 400 bases could be correctly classified with an accuracy of 85%. We then applied the classifier to the identification of horizontal gene transfer events In whole-genome sequences and demonstrated the validity of our approach by correctly predicting the transfer of both the superoxide dismutase (sodC) and the bioC gene from Haemophilus influenzae to Neisseria meningitis, correctly identifying both the donor and recipient species. We believe that this classification methodology could be a valuable tool in biodiversity studies.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available