4.7 Article

SeqWho: reliable, rapid determination of sequence file identity using k-mer frequencies in Random Forest classifiers

期刊

BIOINFORMATICS
卷 38, 期 7, 页码 1830-1837

出版社

OXFORD UNIV PRESS
DOI: 10.1093/bioinformatics/btac050

关键词

-

资金

  1. National Institute of General Medical Sciences (NIH) [R01-GM135341]
  2. Cancer Prevention Research Institute of Texas (CPRIT) [RR170068]
  3. NIH [5U24DK110814-05]
  4. CPRIT grant Cancer Prevention and Research Institute of Texas [RP150596]

向作者/读者索取更多资源

SeqWho is a program that accurately and rapidly classifies sequencing files by analyzing their characteristics, providing reliable identification of the organism and protocol type.
Motivation: With the vast improvements in sequencing technologies and increased number of protocols, sequencing is being used to answer complex biological problems. Subsequently, analysis pipelines have become more time consuming and complicated, usually requiring highly extensive prevalidation steps. Here, we present SeqWho, a program designed to assess heuristically the quality of sequencing files and reliably classify the organism and protocol type by using Random Forest classifiers trained on biases native in k-mer frequencies and repeat sequence identities. Results: Using one of our primary models, we show that our method accurately and rapidly classifies human and mouse sequences from nine different sequencing libraries by species, library and both together, 98.32%, 97.86% and 96.38% of the time, respectively. Ultimately, we demonstrate that SeqWho is a powerful method for reliably validating the quality and identity of the sequencing files used in any pipeline.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据