4.6 Article

QualComp: a new lossy compressor for quality scores based on rate distortion theory

Journal

BMC BIOINFORMATICS
Volume 14, Issue -, Pages -

Publisher

BMC
DOI: 10.1186/1471-2105-14-187

Keywords

Next generation sequencing; Quality scores; Compression; FASTQ format; Rate distortion; Mean squared error

Funding

  1. Scott A. and Geraldine D. Macomber Stanford Graduate Fellowship
  2. Thomas and Sarah Kailath Fellowship in Science and Engineering
  3. 3Com Corporation Stanford Graduate Fellowship
  4. La Caixa Fellowship
  5. EMBO long term Fellowship
  6. Center for Science of Information (CSoI)
  7. Hewlett Packard Labs

Ask authors/readers for more resources

Background: Next Generation Sequencing technologies have revolutionized many fields in biology by reducing the time and cost required for sequencing. As a result, large amounts of sequencing data are being generated. A typical sequencing data file may occupy tens or even hundreds of gigabytes of disk space, prohibitively large for many users. This data consists of both the nucleotide sequences and per-base quality scores that indicate the level of confidence in the readout of these sequences. Quality scores account for about half of the required disk space in the commonly used FASTQ format (before compression), and therefore the compression of the quality scores can significantly reduce storage requirements and speed up analysis and transmission of sequencing data. Results: In this paper, we present a new scheme for the lossy compression of the quality scores, to address the problem of storage. Our framework allows the user to specify the rate (bits per quality score) prior to compression, independent of the data to be compressed. Our algorithm can work at any rate, unlike other lossy compression algorithms. We envisage our algorithm as being part of a more general compression scheme that works with the entire FASTQ file. Numerical experiments show that we can achieve a better mean squared error (MSE) for small rates (bits per quality score) than other lossy compression schemes. For the organism PhiX, whose assembled genome is known and assumed to be correct, we show that it is possible to achieve a significant reduction in size with little compromise in performance on downstream applications (e. g., alignment). Conclusions: QualComp is an open source software package, written in C and freely available for download at https://sourceforge.net/projects/qualcomp. It is designed to lossily compress the quality scores presented in a FASTQ file. Given a model for the quality scores, we use rate-distortion results to optimally allocate the available bits in order to minimize the MSE. This metric allows us to compare different lossy compression algorithms for quality scores without depending on downstream applications that may use the quality scores in very different ways.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available