4.7 Article

CHTKC: a robust and efficient k-mer counting algorithm based on a lock-free chaining hash table

Journal

BRIEFINGS IN BIOINFORMATICS
Volume 22, Issue 3, Pages -

Publisher

OXFORD UNIV PRESS
DOI: 10.1093/bib/bbaa063

Keywords

assembly; DNA-seq; hash table; sequence analysis; k-mer counting; algorithm

Funding

  1. National Natural Science Foundation of China [61771165]
  2. International Postdoctoral Exchange Fellowship [20130053]
  3. China Postdoctoral Science Foundation [2018T110302, 2014M551246]

Ask authors/readers for more resources

The paper proposes a new method called CHTKC to efficiently calculate the frequency of each substring of length k in DNA sequences, using a lock-free hash table and linked lists to resolve collisions and optimize memory usage. Thorough testing on multiple datasets shows that using a hash-table-based method remains a feasible solution for the k-mer counting problem.
Motivation: Calculating the frequency of occurrence of each substring of length k in DNA sequences is a common task in many bioinformatics applications, including genome assembly, error correction, and sequence alignment. Although the problem is simple, efficient counting of datasets with high sequencing depth or large genome size is a challenge. Results: We propose a robust and efficient method, CHTKC, to solve the k-mer counting problem with a lock-free hash table that uses linked lists to resolve collisions. We also design new mechanisms to optimize memory usage and handle situations where memory is not enough to accommodate all k-mers. CHTKC has been thoroughly tested on seven datasets under multiple memory usage scenarios and compared with Jellyfish2 and KMC3. Our work shows that using a hash-table-based method to effectively solve the k-mer counting problem remains a feasible solution.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available