Journal
BIOINFORMATICS
Volume 30, Issue 1, Pages 1-8Publisher
OXFORD UNIV PRESS
DOI: 10.1093/bioinformatics/btt250
Keywords
-
Categories
Funding
- NIH [NIH 5R01-HG004962]
- iDASH project [U54 HL108460]
- CSRO scholarship
- NATIONAL CANCER INSTITUTE [P30CA023100] Funding Source: NIH RePORTER
- NATIONAL HEART, LUNG, AND BLOOD INSTITUTE [U54HL108460] Funding Source: NIH RePORTER
- NATIONAL HUMAN GENOME RESEARCH INSTITUTE [R01HG004962] Funding Source: NIH RePORTER
Ask authors/readers for more resources
Motivation: With high-throughput DNA sequencing costs dropping 5$1000 for human genomes, data storage, retrieval and analysis are the major bottlenecks in biological studies. To address the large-data challenges, we advocate a clean separation between the evidence collection and the inference in variant calling. We define and implement a Genome Query Language (GQL) that allows for the rapid collection of evidence needed for calling variants. Results: We provide a number of cases to showcase the use of GQL for complex evidence collection, such as the evidence for large structural variations. Specifically, typical GQL queries can be written in 5-10 lines of high-level code and search large datasets (100 GB) in minutes. We also demonstrate its complementarity with other variant calling tools. Popular variant calling tools can achieve one order of magnitude speed-up by using GQL to retrieve evidence. Finally, we show how GQL can be used to query and compare multiple datasets. By separating the evidence and inference for variant calling, it frees all variant detection tools from the data intensive evidence collection and focuses on statistical inference.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available