4.7 Article

ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies

Journal

BIOINFORMATICS
Volume 29, Issue 4, Pages 435-443

Publisher

OXFORD UNIV PRESS
DOI: 10.1093/bioinformatics/bts723

Keywords

-

Funding

  1. Office of Science of the U.S. Department of Energy [DE-FG02-97ER25308, DE-AC02-05CH112, DE-AC02-05CH11231]
  2. Startup and Production Allocation Award from the National Energy Research Scientific Computing Center (NERSC) of the Office of Science of the U.S. Department of Energy [DE-AC02-05CH11231]
  3. Air Force Office of Scientific Research [FA9550-12-1-0200]
  4. Direct For Computer & Info Scie & Enginr
  5. Div Of Information & Intelligent Systems [1247696, 1247637] Funding Source: National Science Foundation
  6. Direct For Computer & Info Scie & Enginr
  7. Div Of Information & Intelligent Systems [1142251] Funding Source: National Science Foundation

Ask authors/readers for more resources

Motivation: Researchers need general purpose methods for objectively evaluating the accuracy of single and metagenome assemblies and for automatically detecting any errors they may contain. Current methods do not fully meet this need because they require a reference, only consider one of the many aspects of assembly quality or lack statistical justification, and none are designed to evaluate metagenome assemblies. Results: In this article, we present an Assembly Likelihood Evaluation (ALE) framework that overcomes these limitations, systematically evaluating the accuracy of an assembly in a reference-independent manner using rigorous statistical methods. This framework is comprehensive, and integrates read quality, mate pair orientation and insert length (for paired-end reads), sequencing coverage, read alignment and k-mer frequency. ALE pinpoints synthetic errors in both single and metagenomic assemblies, including single-base errors, insertions/deletions, genome rearrangements and chimeric assemblies presented in metagenomes. At the genome level with real-world data, ALE identifies three large misassemblies from the Spirochaeta smaragdinae finished genome, which were all independently validated by Pacific Biosciences sequencing. At the single-base level with Illumina data, ALE recovers 215 of 222 (97%) single nucleotide variants in a training set from a GC-rich Rhodobacter sphaeroides genome. Using real Pacific Biosciences data, ALE identifies 12 of 12 synthetic errors in a Lambda Phage genome, surpassing even Pacific Biosciences' own variant caller, EviCons. In summary, the ALE framework provides a comprehensive, reference-independent and statistically rigorous measure of single genome and metagenome assembly accuracy, which can be used to identify misassemblies or to optimize the assembly process.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available