4.5 Article

Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees

期刊

JOURNAL OF COMPUTATIONAL BIOLOGY
卷 25, 期 7, 页码 755-765

出版社

MARY ANN LIEBERT, INC
DOI: 10.1089/cmb.2017.0265

关键词

data indexing; RNA-seq; sequence bloom trees; sequence search

资金

  1. Gordon and Betty Moore Foundation's Data-Driven Discovery Initiative [GBMF4554]
  2. U.S. National Science Foundation [CCF-1256087, CCF-1319998]
  3. U.S. National Institutes of Health [R21HG006913, R01HG007104]
  4. U.S. National Institutes of Health training grant as part of the Howard Hughes Medical Institute (HHMI)-National Institute of Biomedical Imaging and Bioengineering (NIBIB) Interfaces Initiative [T32 EB009403]

向作者/读者索取更多资源

Enormous databases of short-read RNA-seq experiments such as the NIH Sequencing Read Archive are now available. These databases could answer many questions about condition-specific expression or population variation, and this resource is only going to grow over time. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. Although some progress has been made on this problem, it is still not feasible to search collections of hundreds of terabytes of short-read sequencing experiments. We introduce an indexing scheme called split sequence bloom trees (SSBTs) to support sequence-based querying of terabyte scale collections of thousands of short-read sequencing experiments. SSBT is an improvement over the sequence bloom tree (SBT) data structure for the same task. We apply SSBTs to the problem of finding conditions under which query transcripts are expressed. Our experiments are conducted on a set of 2652 publicly available RNA-seq experiments for the breast, blood, and brain tissues. We demonstrate that this SSBT index can be queried for a 1000nt sequence in <4 minutes using a single thread and can be stored in just 39 GB, a fivefold improvement in search and storage costs compared with SBT.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据