☆ 4.7 Article

VC@Scale: Scalable and high-performance variant calling on cluster environments

GIGASCIENCE (2021)

期刊

GIGASCIENCE

卷 10, 期 9, 页码 -

出版社

OXFORD UNIV PRESS

DOI: 10.1093/gigascience/giab057

关键词

whole-genome sequencing; Apache Spark; Apache Arrow; BWA-MEM; sorting; MarkDuplicate; DeepVariant

类别

Biology Multidisciplinary Sciences

资金

Punjab Educational Endowment Fund (PEEF), Pakistan

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

The article proposes a high-performance workflow for variant calling methods based on deep learning and introduces efficient data transfer between different workflow stages using Python and Apache Arrow. The design outperforms existing implementations by over 2 times in preprocessing stages, creating a scalable and high-performance solution for DeepVariant.

Background: Recently many new deep learning-based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow's columnar in-memory data transformations. Results: Here we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters. Conclusions: We show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale.

VC@Scale: Scalable and high-performance variant calling on cluster environments

期刊

GIGASCIENCE

出版社

OXFORD UNIV PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

VC@Scale: Scalable and high-performance variant calling on cluster environments

期刊

GIGASCIENCE

出版社

OXFORD UNIV PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文