☆ 4.7 Article

HashSeq: a Simple, Scalable, and Conservative De Novo Variant Caller for 16S rRNA Gene Data Sets

MSYSTEMS (2021)

期刊

MSYSTEMS

卷 6, 期 6, 页码 -

出版社

AMER SOC MICROBIOLOGY

DOI: 10.1128/mSystems.00697-21

关键词

16S rRNA gene sequence variant; microbiome; sequence variant; sequencing error

类别

Microbiology

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This study introduces a fast and scalable algorithm for inferring sequence variants in 16S rRNA gene data sets, utilizing a HashMap-based approach and normal distribution combined with LOESS regression. The method is computationally efficient and produces conservative sets of variants well supported by reference databases.

16S rRNA gene sequencing is a common and cost-effective technique for characterization of microbial communities. Recent bioinformatics methods enable high-resolution detection of sequence variants of only one nucleotide difference. In this study, we utilized a very fast HashMap-based approach to detect sequence variants in six publicly available 16S rRNA gene data sets. We then use the normal distribution combined with locally estimated scatterplot smoothing (LOESS) regression to estimate background error rates as a function of sequencing depth for individual clusters of sequences. This method is computationally efficient and produces inference that yields sets of variants that are conservative and well supported by reference databases. We argue that this approach to inference is fast, simple, and scalable to large data sets and provides a high-resolution set of sequence variants which are less likely to be the result of sequencing error. IMPORTANCE Recent bioinformatics development has enabled the detection of sequence variants with a high resolution of only one single-nucleotide difference in 16S rRNA gene sequence data. Despite this progress, there are several limitations that can be associated with variant calling pipelines, such as producing a large number of low-abundance sequence variants which need to be filtered out with arbitrary thresholds in downstream analyses or having a slow runtime. In this report, we introduce a fast and scalable algorithm which infers sequence variants based on the estimation of a normally distributed background error as a function of sequencing depth. Our pipeline has attractive performance characteristics, can be used independently or in parallel with other variant callers, and provides explicit P values for each variant evaluating the hypothesis that a variant is caused by sequencing error.

HashSeq: a Simple, Scalable, and Conservative De Novo Variant Caller for 16S rRNA Gene Data Sets

期刊

MSYSTEMS

出版社

AMER SOC MICROBIOLOGY

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

HashSeq: a Simple, Scalable, and Conservative De Novo Variant Caller for 16S rRNA Gene Data Sets

期刊

MSYSTEMS

出版社

AMER SOC MICROBIOLOGY

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文