4.7 Article

Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism-calling pipelines

期刊

GIGASCIENCE
卷 9, 期 2, 页码 -

出版社

OXFORD UNIV PRESS
DOI: 10.1093/gigascience/giaa007

关键词

SNP calling; variant calling; evaluation; benchmarking; bacteria

资金

  1. National Institute for Health Research Health Protection Research Unit (NIHR HPRU) in Healthcare Associated Infections and Antimicrobial Resistance at Oxford University
  2. Public Health England (PHE) [HPRU-2012-10 041]
  3. NIHR Biomedical Research Centre
  4. Health Data Research UK
  5. NIHR Oxford Biomedical Research Centre
  6. National Institute for Health Research
  7. University of Oxford/Public Health England Clinical Lectureship
  8. Antimicrobial Resistance Cross Council [NE/N019989/1]
  9. BBSRC Institute Strategic Program [BB/P013740/1]
  10. BBSRC [BBS/E/D/20002173] Funding Source: UKRI

向作者/读者索取更多资源

Background: Accurately identifying single-nucleotide polymorphisms (SNPs) from bacterial sequencing data is an essential requirement for using genomics to track transmission and predict important phenotypes such as antimicrobial resistance. However, most previous performance evaluations of SNP calling have been restricted to eukaryotic (human) data. Additionally, bacterial SNP calling requires choosing an appropriate reference genome to align reads to, which, together with the bioinformatic pipeline, affects the accuracy and completeness of a set of SNP calls obtained. This study evaluates the performance of 209 SNP-calling pipelines using a combination of simulated data from 254 strains of 10 clinically common bacteria and real data from environmentally sourced and genomically diverse isolates within the genera Citrobacter, Enterobacter, Escherichia, and Klebsiella. Results: We evaluated the performance of 209 SNP-calling pipelines, aligning reads to genomes of the same or a divergent strain. Irrespective of pipeline, a principal determinant of reliable SNP calling was reference genome selection. Across multiple taxa, there was a strong inverse relationship between pipeline sensitivity and precision, and the Mash distance (a proxy for average nucleotide divergence) between reads and reference genome. The effect was especially pronounced for diverse, recombinogenic bacteria such as Escherichia coli but less dominant for clonal species such as Mycobacterium tuberculosis. Conclusions: The accuracy of SNP calling for a given species is compromised by increasing intra-species diversity. When reads were aligned to the same genome from which they were sequenced, among the highest-performing pipelines was Novoalign/GATK. By contrast, when reads were aligned to particularly divergent genomes, the highest-performing pipelines often used the aligners NextGenMap or SMALT, and/or the variant callers LoFreq, mpileup, or Strelka.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据