4.7 Article

Fast two-stage phasing of large-scale sequence data

期刊

AMERICAN JOURNAL OF HUMAN GENETICS
卷 108, 期 10, 页码 1880-1890

出版社

CELL PRESS
DOI: 10.1016/j.ajhg.2021.08.005

关键词

-

资金

  1. National Human Genome Research Institute of the National Institutes of Health [HG008359]
  2. National Heart, Lung, and Blood Institute (NHLBI)
  3. TOPMed Informatics Research Center [3R01HL117626-02S1, HHSN268201800002I]
  4. National Institutes of Health (NIH) [R01HL104608, R01HL087699, HL104608 S1]
  5. NHLBI [NO1-HC-25195, HHSN268201500001I, 75N92019D00031, R01 HL092577-06S1]
  6. [R01HL-120393]
  7. [U01HL-120393]
  8. [HHSN268201800001I]
  9. [HHSN268201600034I]
  10. [U54HG003067]

向作者/读者索取更多资源

Haplotype phasing method presented in this study efficiently estimates haplotypes from genotype data using marker windowing and composite reference haplotypes. It employs a progressive phasing algorithm and two-stage phasing algorithm for high-frequency and low-frequency markers respectively. Performance comparison shows that Beagle 5.2 is over 20 times faster than SHAPEIT for TOPMed sequence data, while achieving similar accuracy and scalability for larger sample sizes.
Haplotype phasing is the estimation of haplotypes from genotype data. We present a fast, accurate, and memory-efficient haplotype phasing method that scales to large-scale SNP array and sequence data. The method uses marker windowing and composite reference haplotypes to reduce memory usage and computation time. It incorporates a progressive phasing algorithm that identifies confidently phased heterozygotes in each iteration and fixes the phase of these heterozygotes in subsequent iterations. For data with many low-frequency variants, such as whole-genome sequence data, the method employs a two-stage phasing algorithm that phases high-frequency markers via progressive phasing in the first stage and phases low-frequency markers via genotype imputation in the second stage. This haplotype phasing method is implemented in the open-source Beagle 5.2 software package. We compare Beagle 5.2 and SHAPEIT 4.2.1 by using expanding subsets of 485,301 UK Biobank samples and 38,387 TOPMed samples. Both methods have very similar accuracy and computation time for UK Biobank SNP array data. However, for TOPMed sequence data, Beagle is more than 20 times faster than SHAPEIT, achieves similar accuracy, and scales to larger sample sizes.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据