☆ 4.5 Article

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

JOURNAL OF COMPUTATIONAL BIOLOGY (2020)

期刊

JOURNAL OF COMPUTATIONAL BIOLOGY

卷 27, 期 4, 页码 500-513

出版社

MARY ANN LIEBERT, INC

DOI: 10.1089/cmb.2019.0309

关键词

Burrows-Wheeler Transform; indexing; r-index; pan-genomics

类别

Biochemical Research Methods Biotechnology & Applied Microbiology Computer Science, Interdisciplinary Applications Mathematical & Computational Biology Statistics & Probability

资金

National Science Foundation [IIS1618814, IIS-1349906]
National Institutes of Health/National Institute of Allergy and Infectious Diseases (NIAID) [R01AI141810-01]
National Institutes of Health/National Institute of General Medical Sciences [R01GM118568]
Fondecyt [1171058]
MIUR-PRIN [2017WR7SHH]
INdAM-GNCS project Innovative methods for the solution of medical and biological big data

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Short-read aligners predominantly use the FM-index, which is easily able to index one or a few human genomes. However, it does not scale well to indexing collections of thousands of genomes. Driving this issue are the two chief components of the index: (1) a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA), and (2) a sample of the SA that-when used with the rank data structure-allows us to access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that (SODA 2018) has defined an SA sample that takes about the same space as the run-length compressed BWT, we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018, we showed how to build the BWT of large genomic databases efficiently (WABI 2018), but the problem of building the sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over the FM-index-based Bowtie method with respect to both memory and time and over the hybrid index-based CHIC method with respect to query time and memory required for indexing.

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

期刊

JOURNAL OF COMPUTATIONAL BIOLOGY

出版社

MARY ANN LIEBERT, INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

期刊

JOURNAL OF COMPUTATIONAL BIOLOGY

出版社

MARY ANN LIEBERT, INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文