☆ 4.7 Article Proceedings Paper

A fast adaptive algorithm for computing whole-genome homology maps

BIOINFORMATICS (2018)

Journal

BIOINFORMATICS

Volume 34, Issue 17, Pages 748-756

Publisher

OXFORD UNIV PRESS

DOI: 10.1093/bioinformatics/bty597

Keywords

Funding

Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health
U.S. National Science Foundation [CCF-1816027]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Motivation: Whole-genome alignment is an important problem in genomics for comparing different species, mapping draft assemblies to reference genomes and identifying repeats. However, for large plant and animal genomes, this task remains compute and memory intensive. In addition, current practical methods lack any guarantee on the characteristics of output alignments, thus making them hard to tune for different application requirements. Results: We introduce an approximate algorithm for computing local alignment boundaries between long DNA sequences. Given a minimum alignment length and an identity threshold, our algorithm computes the desired alignment boundaries and identity estimates using kmer-based statistics, and maintains sufficient probabilistic guarantees on the output sensitivity. Further, to prioritize higher scoring alignment intervals, we develop a plane-sweep based filtering technique which is theoretically optimal and practically efficient. Implementation of these ideas resulted in a fast and accurate assembly-to-genome and genome-to-genome mapper. As a result, we were able to map an error-corrected whole-genome NA12878 human assembly to the hg38 human reference genome in about 1 min total execution time and <4 GB memory using eight CPU threads, achieving significant improvement in memory-usage over competing methods. Recall accuracy of computed alignment boundaries was consistently found to be > 97% on multiple datasets. Finally, we performed a sensitive self-alignment of the human genome to compute all duplications of length >= 1 Kbp and >= 90% identity. The reported output achieves good recall and covers twice the number of bases than the current UCSC browser's segmental duplication annotation.

A fast adaptive algorithm for computing whole-genome homology maps

Journal

BIOINFORMATICS

Publisher

OXFORD UNIV PRESS

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

A fast adaptive algorithm for computing whole-genome homology maps

Journal

BIOINFORMATICS

Publisher

OXFORD UNIV PRESS

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper