☆ 4.7 Article

Indexing Highly Repetitive String Collections, Part II: Compressed Indexes

ACM COMPUTING SURVEYS (2021)

期刊

ACM COMPUTING SURVEYS

卷 54, 期 2, 页码 -

出版社

ASSOC COMPUTING MACHINERY

DOI: 10.1145/3432999

关键词

Text indexing; string searching; compressed data structures; repetitive string collections

类别

Computer Science, Theory & Methods

资金

ANID Basal Funds [FB0001]
Millennium Science Initiative Program [ICN17_002]
Fondecyt, Chile [1-200038]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

The text discusses a breakthrough in indexing string collections, which allows representing them in a compressed space while offering indexed search functionalities. With the technology permeating through applications like bioinformatics, string collections have experienced rapid growth, particularly due to their high repetitiveness. New techniques have been developed to properly exploit this repetitiveness, leading to the creation of a new generation of data structures capable of handling huge repetitive string collections.

Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore's Law and challenges our ability of handling them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey, formed by two parts, we cover the algorithmic developments that have led to these data structures. In this second part, we describe the fundamental algorithmic ideas and data structures that form the base of all the existing indexes, and the various concrete structures that have been proposed, comparing them both in theoretical and practical aspects, and uncovering some new combinations. We conclude with the current challenges in this fascinating field.

Indexing Highly Repetitive String Collections, Part II: Compressed Indexes

期刊

ACM COMPUTING SURVEYS

出版社

ASSOC COMPUTING MACHINERY

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Indexing Highly Repetitive String Collections, Part II: Compressed Indexes

期刊

ACM COMPUTING SURVEYS

出版社

ASSOC COMPUTING MACHINERY

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文