Journal
BIOINFORMATICS
Volume 30, Issue 24, Pages 3524-3531Publisher
OXFORD UNIV PRESS
DOI: 10.1093/bioinformatics/btu584
Keywords
-
Categories
Funding
- Jackson Lab Center for Genome Dynamics: Evolution, Organization and Function [NIH P50GM076468]
- University of North Carolina Center for CISGen: Systems Genetics of Psychiatric Disorders [NIH P50HG006582, P50MH090338]
Ask authors/readers for more resources
Motivation: The throughput of genomic sequencing has increased to the point that is overrunning the rate of downstream analysis. This, along with the desire to revisit old data, has led to a situation where large quantities of raw, and nearly impenetrable, sequence data are rapidly filling the hard drives of modern biology labs. These datasets can be compressed via a multi-string variant of the Burrows-Wheeler Transform (BWT), which provides the side benefit of searches for arbitrary k-mers within the raw data as well as the ability to reconstitute arbitrary reads as needed. We propose a method for merging such datasets for both increased compression and downstream analysis. Results: We present a novel algorithm that merges multi-string BWTs in O(LCS x N) time where LCS is the length of their longest common substring between any of the inputs, and N is the total length of all inputs combined (number of symbols) using O(N x log(2)(F)) bits where F is the number of multi-string BWTs merged. This merged multi-string BWT is also shown to have a higher compressibility compared with the input multi-string BWTs separately. Additionally, we explore some uses of a merged multi-string BWT for bioinformatics applications.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available