期刊
ACM COMPUTING SURVEYS
卷 54, 期 2, 页码 -出版社
ASSOC COMPUTING MACHINERY
DOI: 10.1145/3432999
关键词
Text indexing; string searching; compressed data structures; repetitive string collections
资金
- ANID Basal Funds [FB0001]
- Millennium Science Initiative Program [ICN17_002]
- Fondecyt, Chile [1-200038]
The text discusses a breakthrough in indexing string collections, which allows representing them in a compressed space while offering indexed search functionalities. With the technology permeating through applications like bioinformatics, string collections have experienced rapid growth, particularly due to their high repetitiveness. New techniques have been developed to properly exploit this repetitiveness, leading to the creation of a new generation of data structures capable of handling huge repetitive string collections.
Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore's Law and challenges our ability of handling them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey, formed by two parts, we cover the algorithmic developments that have led to these data structures. In this second part, we describe the fundamental algorithmic ideas and data structures that form the base of all the existing indexes, and the various concrete structures that have been proposed, comparing them both in theoretical and practical aspects, and uncovering some new combinations. We conclude with the current challenges in this fascinating field.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据