4.5 Article

Data Set-Adaptive Minimizer Order Reduces Memory Usage in k-Mer Counting

Journal

JOURNAL OF COMPUTATIONAL BIOLOGY
Volume 29, Issue 8, Pages 825-838

Publisher

MARY ANN LIEBERT, INC
DOI: 10.1089/cmb.2021.0599

Keywords

bin mapping; k-mer counting; minimizer order; minimizer scheme; sequencing

Ask authors/readers for more resources

The study introduces a method to tailor the order to the data set, reducing memory consumption. By integrating this method into a memory-efficient k-mer counter, the memory footprint was significantly reduced with only a slight increase in runtime. Experimental results showed that the orders produced by this method performed well across data sets from the same species, enabling memory reduction without significant runtime increase.
The rapid continuous growth of deep sequencing experiments requires development and improvement of many bioinformatic applications for analysis of large sequencing data sets, including k-mer counting and assembly. Several applications reduce memory usage by binning sequences. Binning is done by using minimizer schemes, which rely on a specific order of the minimizers. It has been demonstrated that the choice of the order has a major impact on the performance of the applications. Here we introduce a method for tailoring the order to the data set. Our method repeatedly samples the data set and modifies the order so as to flatten the k-mer load distribution across minimizers. We integrated our method into Gerbil, a state-of-the-art memory-efficient k-mer counter, and were able to reduce its memory footprint by 30%-50% for large k, with only a minor increase in runtime. Our tests also showed that the orders produced by our method produced superior results when transferred across data sets from the same species, with little or no order change. This enables memory reduction with essentially no increase in runtime.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available