☆ 4.7 Article

Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data

GIGASCIENCE (2022)

Journal

GIGASCIENCE

Volume 11, Issue -, Pages -

Publisher

OXFORD UNIV PRESS

DOI: 10.1093/gigascience/giac032

Keywords

site frequency spectrum; high-throughput sequencing; genotype likelihoods; next-generation sequencing; maximum likelihood; population genetics; threading

Funding

Carlsberg [CF19-0712]
Leverhulme Research Project [RPG-2018-208]
Lundbeck Foundation Centre for Disease Evolution Grant [R302-2018-2155]
Erasmus+ programme
Imperial College FoNS European Partners award

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This study presents a new method for accurately estimating the multidimensional joint site frequency spectrum based on low-coverage sequencing data. The method is computationally efficient and storage-saving compared to previous implementations, and its accuracy has been validated using data from a fungal pathogen.

Background The site frequency spectrum summarizes the distribution of allele frequencies throughout the genome, and it is widely used as a summary statistic to infer demographic parameters and to detect signals of natural selection. The use of high-throughput low-coverage DNA sequencing data can lead to biased estimates of the site frequency spectrum due to high levels of uncertainty in genotyping. Results Here we design and implement a method to efficiently and accurately estimate the multidimensional joint site frequency spectrum for large numbers of haploid or diploid individuals across an arbitrary number of populations, using low-coverage sequencing data. The method maximizes a likelihood function that represents the probability of the sequencing data observed given a multidimensional site frequency spectrum using genotype likelihoods. Notably, it uses an advanced binning heuristic paired with an accelerated expectation-maximization algorithm for a fast and memory-efficient computation, and can generate both unfolded and folded spectra and bootstrapped replicates for haploid and diploid genomes. On the basis of extensive simulations, we show that the new method requires remarkably less storage and is faster than previous implementations whilst retaining the same accuracy. When applied to low-coverage sequencing data from the fungal pathogen Neonectria neomacrospora, results recapitulate the patterns of population differentiation generated using the original high-coverage data. Conclusion The new implementation allows for accurate estimation of population genetic parameters from arbitrarily large, low-coverage datasets, thus facilitating cost-effective sequencing experiments in model and non-model organisms.

Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data

Journal

GIGASCIENCE

Publisher

OXFORD UNIV PRESS

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data

Journal

GIGASCIENCE

Publisher

OXFORD UNIV PRESS

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper