Journal
JOURNAL OF COMPUTATIONAL BIOLOGY
Volume -, Issue -, Pages -Publisher
MARY ANN LIEBERT, INC
DOI: 10.1089/cmb.2021.0431
Keywords
confidence intervals; k-mers; MinHash; mutation process; sketching; Jaccard similarity
Ask authors/readers for more resources
In this study, we investigate the impact of a simple mutation process on k-mers in sequences such as genomes or reads. We derive the expected values and variances of mutated k-mers, as well as islands and oceans, and provide hypothesis tests and confidence intervals based on the observed number of mutated k-mers or Jaccard similarity.
k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available