期刊
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS
卷 17, 期 1, 页码 136-148出版社
IEEE COMPUTER SOC
DOI: 10.1109/TCBB.2018.2858797
关键词
De Bruijn graph; genome assembly; graph compaction; parallel algorithms
类别
资金
- National Science Foundation [IIS-1416259, CNS-1229081, CCF-1360593, CCF-1361053]
- Intel Parallel Computing Center on Big Data in Biosciences and Public Health
- Division of Computing and Communication Foundations
- Direct For Computer & Info Scie & Enginr [1361053] Funding Source: National Science Foundation
De Bruijn graph based genome assembly has gained popularity as short read sequencers become ubiquitous. A core assembly operation is the generation of unitigs, which are sequences corresponding to chains in the graph. Unitigs are used as building blocks for generating longer sequences in many assemblers, and can facilitate graph compression. Chain compaction, by which unitigs are generated, remains a critical computational task. In this paper, we present a distributed memory parallel algorithm for simultaneous compaction of all chains in bi-directed de Bruijn graphs. The key advantages of our algorithm include bounding the chain compaction run-time to logarithmic number of iterations in the length of the longest chain, and ability to differentiate cycles from chains within logarithmic number of iterations in the length of the longest cycle. Our algorithm scales to thousands of computational cores, and can compact a whole genome de Bruijn graph from a human sequence read set in 7.3 seconds using 7680 distributed memory cores, and in 12.9 minutes using 64 shared memory cores. It is 3.7 x and 2.0x faster than equivalent steps in the state-of-the-art tools for distributed and shared memory environments, respectively. An implementation of the algorithm is available at https://github.com/ParBLiSS/bruno.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据