4.5 Article

The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches

Journal

JOURNAL OF COMPUTATIONAL BIOLOGY
Volume -, Issue -, Pages -

Publisher

MARY ANN LIEBERT, INC
DOI: 10.1089/cmb.2021.0431

Keywords

confidence intervals; k-mers; MinHash; mutation process; sketching; Jaccard similarity

Ask authors/readers for more resources

In this study, we investigate the impact of a simple mutation process on k-mers in sequences such as genomes or reads. We derive the expected values and variances of mutated k-mers, as well as islands and oceans, and provide hypothesis tests and confidence intervals based on the observed number of mutated k-mers or Jaccard similarity.
k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available