4.5 Article

Storage and Retrieval of Highly Repetitive Sequence Collections

期刊

JOURNAL OF COMPUTATIONAL BIOLOGY
卷 17, 期 3, 页码 281-308

出版社

MARY ANN LIEBERT, INC
DOI: 10.1089/cmb.2009.0169

关键词

comparative genomics; compressed data structures; full-text indexing; suffix tree

资金

  1. Academy of Finland [119815]
  2. Millennium Institute for Cell Dynamics and Biotechnology (ICDB) [ICM P05-001-F]
  3. Research Foundation of the University of Helsinki
  4. Helsinki Graduate School in Computer Science and Engineering
  5. Academy of Finland (AKA) [119815, 119815] Funding Source: Academy of Finland (AKA)

向作者/读者索取更多资源

A repetitive sequence collection is a set of sequences which are small variations of each other. A prominent example are genome sequences of individuals of the same or close species, where the differences can be expressed by short lists of basic edit operations. Flexible and efficient data analysis on such a typically huge collection is plausible using suffix trees. However, the suffix tree occupies much space, which very soon inhibits in-memory analyses. Recent advances in full-text indexing reduce the space of the suffix tree to, essentially, that of the compressed sequences, while retaining its functionality with only a polylogarithmic slowdown. However, the underlying compression model considers only the predictability of the next sequence symbol given the k previous ones, where k is a small integer. This is unable to capture longer-term repetitiveness. For example, r identical copies of an incompressible sequence will be incompressible under this model. We develop new static and dynamic full-text indexes that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations. The new indexes can be plugged into a recent dynamic fully-compressed suffix tree, achieving full functionality for sequence analysis, while retaining the reduced space and the polylogarithmic slowdown. Our experimental results confirm the practicality of our proposal.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据