4.7 Article

Efficient minimizer orders for large values of k using minimum decycling sets

Journal

GENOME RESEARCH
Volume 33, Issue 7, Pages 1154-1161

Publisher

COLD SPRING HARBOR LAB PRESS, PUBLICATIONS DEPT
DOI: 10.1101/gr.277644.123

Keywords

-

Ask authors/readers for more resources

Minimizer schemes, commonly used in high-throughput DNA sequencing data analysis, often select more k-mers than necessary, leading to limited improvement in runtime and memory usage. Universal k-mer hitting sets provide a solution to reduce the number of selected k-mers, but are currently infeasible for large k values. This study introduces decycling-set-based minimizer orders, which improve the efficiency of minimizer orders for large k values by selecting a comparable number of k-mers to universal k-mer hitting sets. Additionally, a method is developed to compute minimizers in real-time without keeping the k-mers in memory, allowing this approach to be used for any value of k. The new orders are expected to enhance the performance of algorithms and data structures in high-throughput DNA sequencing analysis.
Minimizers are ubiquitously used in data structures and algorithms for efficient searching, mapping, and indexing of high-throughput DNA sequencing data. Minimizer schemes select a minimum k-mer in every L-long subsequence of the target sequence, where minimality is with respect to a predefined k-mer order. Commonly used minimizer orders select more k-mers than necessary and therefore provide limited improvement in runtime and memory usage of downstream analysis tasks. The recently introduced universal k-mer hitting sets produce minimizer orders with fewer selected k-mers. Generating compact universal k-mer hitting sets is currently infeasible for k > 13, and thus, they cannot help in the many applications that require minimizer orders for larger k. Here, we close the gap of efficient minimizer orders for large values of k by introducing decycling-set-based minimizer orders: new minimizer orders based on minimum decycling sets. We show that in practice these new minimizer orders select a number of k-mers comparable to that of minimizer orders based on universal k-mer hitting sets and can also scale to a larger k. Furthermore, we developed a method that computes the minimizers in a sequence on the fly without keeping the k-mers of a decycling set in memory. This enables the use of these minimizer orders for any value of k. We expect the new orders to improve the runtime and memory usage of algorithms and data structures in high-throughput DNA sequencing analysis.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available