☆ 4.4 Article

Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes

GENETICS (2020)

期刊

GENETICS

卷 215, 期 3, 页码 779-797

出版社

GENETICS SOCIETY AMERICA

DOI: 10.1534/genetics.120.303253

关键词

genealogy; tree sequence; genotype statistics

类别

Genetics & Heredity

资金

National Science Foundation [1262645]
National Institutes of Health [R01-GM115564]
Robertson Foundation
Wellcome Trust [100956/Z/13/Z]
Wellcome Trust [100956/Z/13/Z] Funding Source: Wellcome Trust
Direct For Biological Sciences
Div Of Biological Infrastructure [1262645] Funding Source: National Science Foundation

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

As a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe the underlying genealogies. We use this correspondence to define a general framework that efficiently computes single-site population genetic statistics using the succinct tree sequence encoding of genealogies and genome sequence. The general approach accumulates sample weights within the genealogical tree at each position on the genome, which are then combined using a summary function; different statistics result from different choices of weight and function. Results can be reported in three ways: bysite, which corresponds to statistics calculated as usual from genome sequence; bybranch, which gives the expected value of the dual site statistic under the infinite sites model of mutation, and bynode, which summarizes the contribution of each ancestor to these statistics. We use the framework to implement many currently defined statistics of genome sequence (making the statistics' relationship to the underlying genealogical trees concrete and explicit), as well as the corresponding branch statistics of tree shape. We evaluate computational performance using simulated data, and show that calculating statistics from tree sequences using this general framework is several orders of magnitude more efficient than optimized matrix-based methods in terms of both run time and memory requirements. We also explore how well the duality between site and branch statistics holds in practice on trees inferred from the 1000 Genomes Project data set, and discuss ways in which deviations may encode interesting biological signals.

Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes

期刊

GENETICS

出版社

GENETICS SOCIETY AMERICA

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes

期刊

GENETICS

出版社

GENETICS SOCIETY AMERICA

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文