4.5 Article

U50: A New Metric for Measuring Assembly Output Based on Non-Overlapping, Target-Specific Contigs

Journal

JOURNAL OF COMPUTATIONAL BIOLOGY
Volume 24, Issue 11, Pages 1071-1080

Publisher

MARY ANN LIEBERT, INC
DOI: 10.1089/cmb.2017.0013

Keywords

genome assembly; N-50; next-generation sequencing; U-50

Funding

  1. CDC
  2. Centers for Disease Control and Prevention (CDC), through the Advanced Molecular Detection Initiative line item

Ask authors/readers for more resources

Advances in next-generation sequencing technologies enable routine genome sequencing, generating millions of short reads. A crucial step for full genome analysis is the de novo assembly, and currently, performance of different assembly methods is measured by a metric called N-50. However, the N-50 value can produce skewed, inaccurate results when complex data are analyzed, especially for viral and microbial datasets. To provide a better assessment of assembly output, we developed a new metric called U-50. The U-50 identifies unique, target-specific contigs by using a reference genome as baseline, aiming at circumventing some limitations that are inherent to the N-50 metric. Specifically, the U-50 program removes overlapping sequence of multiple contigs by utilizing a mask array, so the performance of the assembly is only measured by unique contigs. We compared simulated and real datasets by using U-50 and N-50, and our results demonstrated that U-50 has the following advantages over N-50: (1) reducing erroneously large N-50 values due to a poor assembly, (2) eliminating overinflated N-50 values caused by large measurements from overlapping contigs, (3) eliminating diminished N-50 values caused by an abundance of small contigs, and (4) allowing comparisons across different platforms or samples based on the new percentage-based metric UG(50)%. The use of the U-50 metric allows for a more accurate measure of assembly performance by analyzing only the unique, non-overlapping contigs. In addition, most viral and microbial sequencing have high background noise (i.e., host and other non-targets), which contributes to having a skewed, misrepresented N-50 valuethis is corrected by U-50. Also, the UG(50)% can be used to compare assembly results from different samples or studies, the cross-comparisons of which cannot be performed with N-50.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available