☆ 4.7 Article

Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions

BIOINFORMATICS (2017)

Journal

BIOINFORMATICS

Volume 33, Issue 15, Pages 2322-2329

Publisher

OXFORD UNIV PRESS

DOI: 10.1093/bioinformatics/btx133

Keywords

Funding

National Institutes of Health [R01-GM101352, R01-HG007178]
National Science Foundation [DBI-1356548]
Div Of Biological Infrastructure
Direct For Biological Sciences [1356548] Funding Source: National Science Foundation

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Motivation: Accurate identification of genotypes is an essential part of the analysis of genomic data, including in identification of sequence polymorphisms, linking mutations with disease and determining mutation rates. Biological and technical processes that adversely affect genotyping include copy-number-variation, paralogous sequences, library preparation, sequencing error and reference-mapping biases, among others. Results: We modeled the read depth for all data as a mixture of Dirichlet-multinomial distributions, resulting in significant improvements over previously used models. In most cases the best model was comprised of two distributions. The major-component distribution is similar to a binomial distribution with low error and low reference bias. The minor-component distribution is overdispersed with higher error and reference bias. We also found that sites fitting the minor component are enriched for copy number variants and low complexity regions, which can produce erroneous genotype calls. By removing sites that do not fit the major component, we can improve the accuracy of genotype calls.

Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions

Journal

BIOINFORMATICS

Publisher

OXFORD UNIV PRESS

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions

Journal

BIOINFORMATICS

Publisher

OXFORD UNIV PRESS

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper