☆ 4.5 Article

The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches

JOURNAL OF COMPUTATIONAL BIOLOGY (2022)

Journal

JOURNAL OF COMPUTATIONAL BIOLOGY

Volume -, Issue -, Pages -

Publisher

MARY ANN LIEBERT, INC

DOI: 10.1089/cmb.2021.0431

Keywords

confidence intervals; k-mers; MinHash; mutation process; sketching; Jaccard similarity

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

In this study, we investigate the impact of a simple mutation process on k-mers in sequences such as genomes or reads. We derive the expected values and variances of mutated k-mers, as well as islands and oceans, and provide hypothesis tests and confidence intervals based on the observed number of mutated k-mers or Jaccard similarity.

k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.

The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches

Journal

JOURNAL OF COMPUTATIONAL BIOLOGY

Publisher

MARY ANN LIEBERT, INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches

Journal

JOURNAL OF COMPUTATIONAL BIOLOGY

Publisher

MARY ANN LIEBERT, INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper