☆ 4.7 Article

Efficient mapping of accurate long reads in minimizer space with mapquik

GENOME RESEARCH (2023)

期刊

GENOME RESEARCH

卷 33, 期 7, 页码 1188-1197

出版社

COLD SPRING HARBOR LAB PRESS, PUBLICATIONS DEPT

DOI: 10.1101/gr.277679.123

关键词

类别

Biochemistry & Molecular Biology Biotechnology & Applied Microbiology Genetics & Heredity

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

DNA sequencing data are improving in terms of longer reads and lower error rates. In this paper, a novel strategy called mapquik is introduced, which creates accurate longer reads by anchoring alignments through matches of consecutively sampled minimizers. Mapquik significantly accelerates the seeding and chaining steps in read mapping, achieving high sensitivity and ultrafast mapping. The results show that mapquik outperforms the state-of-the-art tool minimap2 in terms of speed and accuracy.

DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introduce mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches of k consecutively sampled minimizers (k-min-mers) and only indexing k-min-mers that occur once in the reference genome, thereby unlocking ultrafast mapping while retaining high sensitivity. We show that mapquik significantly accelerates the seeding and chaining steps-fundamental bottlenecks to read mapping-for both the human and maize genomes with >96% sensitivity and near-perfect specificity. On the human genome, for both real and simulated reads, mapquik achieves a 37x speedup over the state-of-the-art tool minimap2, and on the maize genome, mapquik achieves a 410x speedup over minimap2, making mapquik the fastest mapper to date. These accelerations are enabled from not only minimizer-space seeding but also a novel heuristic O(n) pseudochaining algorithm, which improves upon the long-standing O(nlogn) bound. Minimizer-space computation builds the foundation for achieving real-time analysis of long-read sequencing data.

Efficient mapping of accurate long reads in minimizer space with mapquik

期刊

GENOME RESEARCH

出版社

COLD SPRING HARBOR LAB PRESS, PUBLICATIONS DEPT

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Efficient mapping of accurate long reads in minimizer space with mapquik

期刊

GENOME RESEARCH

出版社

COLD SPRING HARBOR LAB PRESS, PUBLICATIONS DEPT

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文