☆ 4.7 Article

pixy: Unbiased estimation of nucleotide diversity and divergence in the presence of missing data

MOLECULAR ECOLOGY RESOURCES (2021)

期刊

MOLECULAR ECOLOGY RESOURCES

卷 21, 期 4, 页码 1359-1368

出版社

WILEY

DOI: 10.1111/1755-0998.13326

关键词

bioinfomatics; phyloinfomatics; genomics; proteomics; molecular evolution; population genetics – empirical; software

类别

Biochemistry & Molecular Biology Ecology Evolutionary Biology

资金

Division of Environmental Biology [DEB-1754439]
National Institute of General Medical Sciences [1R35GM133481-01]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Population genetic analyses often use summary statistics such as pi and d(XY) to describe genetic variation patterns. However, biases can be introduced due to missing data, which can lead to underestimation of genetic diversity within and between populations. In this study, a user-friendly UNIX command line utility, pixy, is introduced to address this issue and provide unbiased estimates of pi and d(XY) regardless of the amount or form of missing data.

Population genetic analyses often use summary statistics to describe patterns of genetic variation and provide insight into evolutionary processes. Among the most fundamental of these summary statistics are pi and d(XY), which are used to describe genetic diversity within and between populations, respectively. Here, we address a widespread issue in pi and d(XY) calculation: systematic bias generated by missing data of various types. Many popular methods for calculating pi and d(XY) operate on data encoded in the variant call format (VCF), which condenses genetic data by omitting invariant sites. When calculating pi and d(XY) using a VCF, it is often implicitly assumed that missing genotypes (including those at sites not represented in the VCF) are homozygous for the reference allele. Here, we show how this assumption can result in substantial downward bias in estimates of pi and d(XY) that is directly proportional to the amount of missing data. We discuss the pervasive nature and importance of this problem in population genetics, and introduce a user-friendly UNIX command line utility, pixy, that solves this problem via an algorithm that generates unbiased estimates of pi and d(XY) in the face of missing data. We compare pixy to existing methods using both simulated and empirical data, and show that pixy alone produces unbiased estimates of pi and d(XY) regardless of the form or amount of missing data. In summary, our software solves a long-standing problem in applied population genetics and highlights the importance of properly accounting for missing data in population genetic analyses.

pixy: Unbiased estimation of nucleotide diversity and divergence in the presence of missing data

期刊

MOLECULAR ECOLOGY RESOURCES

出版社

WILEY

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

pixy: Unbiased estimation of nucleotide diversity and divergence in the presence of missing data

期刊

MOLECULAR ECOLOGY RESOURCES

出版社

WILEY

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文