☆ 4.8 Article

Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA (2020)

Journal

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA

Volume 117, Issue 29, Pages 16961-16968

Publisher

NATL ACAD SCIENCES

DOI: 10.1073/pnas.1903436117

Keywords

probabilistic data structures; spaced seeds; sequence classification; Bloom filters; alignment-free

Funding

National Human Genome Research Institute of the National Institutes of Health [R01HG007182]
Genome Canada
Genome British Columbia [281ANV]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Alignment-free classification tools have enabled high-throughput processing of sequencing data in many bioinformatics analysis pipe-lines primarily due to their computational efficiency. Originally k-mer based, such tools often lack sensitivity when faced with se-quencing errors and polymorphisms. In response, some tools have been augmented with spaced seeds, which are capable of tolerating mismatches. However, spaced seeds have seen little practical use in classification because they bring increased computational and mem-ory costs compared to methods that use k-mers. These limitations have also caused the design and length of practical spaced seeds to be constrained, since storing spaced seeds can be costly. To address these challenges, we have designed a probabilistic data structure called a multiindex Bloom Filter (miBF), which can store multiple spaced seed sequences with a low memory cost that remains static regardless of seed length or seed design. We formalize how to minimize the false-positive rate of miBFs when classifying se-quences from multiple targets or references. Available within Bio-Bloom Tools, we illustrate the utility of miBF in two use cases: read -binning for targeted assembly, and taxonomic read assignment. In our benchmarks, an analysis pipeline based on miBF shows higher sensitivity and specificity for read-binning than sequence alignment -based methods, also executing in less time. Similarly, for taxonomic classification, miBF enables higher sensitivity than a conventional spaced seed-based approach, while using half the memory and an order of magnitude less computational time.

Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters

Journal

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA

Publisher

NATL ACAD SCIENCES

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters

Journal

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA

Publisher

NATL ACAD SCIENCES

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper