☆ 4.7 Article

MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices

METHODS (2016)

期刊

METHODS

卷 102, 期 -, 页码 3-11

出版社

ACADEMIC PRESS INC ELSEVIER SCIENCE

DOI: 10.1016/j.ymeth.2016.02.020

关键词

Metagenome assembly; Succinct data structure; Parallel computing

类别

Biochemical Research Methods Biochemistry & Molecular Biology

资金

Grants-in-Aid for Scientific Research [16H02781] Funding Source: KAKEN

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

The study of metagenomics has been much benefited from low-cost and high-throughput sequencing technologies, yet the tremendous amount of data generated make analysis like de novo assembly to consume too much computational resources. In late 2014 we released MEGAHIT v0.1 (together with a brief note of Li et al. (2015) [1]), which is the first NGS metagenome assembler that can assemble genome sequences from metagenomic datasets of hundreds of Giga base-pairs (bp) in a time- and memory efficient manner on a single server. The core of MEGAHIT is an efficient parallel algorithm for constructing succinct de Bruijn Graphs (SdBG), implemented on a graphical processing unit (GPU). The software has been well received by the assembly community, and there is interest in how to adapt the algorithms to integrate popular assembly practices so as to improve the assembly quality, as well as how to speed up the software using better CPU-based algorithms (instead of GPU). In this paper we first describe the details of the core algorithms in MEGAHIT v0.1, and then we show the new modules to upgrade MEGAHIT to version v1.0, which gives better assembly quality, runs faster and uses less memory. For the Iowa Prairie Soil dataset (252 Gbp after quality trimming), the assembly quality of MEGAHIT v1.0, when compared with v0.1, has a significant improvement, namely, 36% increase in assembly size and 23% in N50. More interestingly, MEGAHIT v1.0 is no slower than before (even running with the extra modules). This is primarily due to a new CPU-based algorithm for SdBG construction that is faster and requires less memory. Using CPU only, MEGAHIT v1.0 can assemble the Iowa Prairie Soil sample in about 43 h, reducing the running time of v0.1 by at least 25% and memory usage by up to 50%. MEGAHIT v1.0, exhibiting a smaller memory footprint, can process even larger datasets. The Kansas Prairie Soil sample (484 Gbp), the largest publicly available dataset, can now be assembled using no more than 500 GB of memory in 7.5 days. The assemblies of these datasets (and other large metgenomic data sets), as well as the software, are available at the website https://hku-bal.github.io/megabox. (C) 2016 Elsevier Inc. All rights reserved.

MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices

期刊

METHODS

出版社

ACADEMIC PRESS INC ELSEVIER SCIENCE

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices

期刊

METHODS

出版社

ACADEMIC PRESS INC ELSEVIER SCIENCE

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文