4.6 Article Proceedings Paper

A multi-task CNN learning model for taxonomic assignment of human viruses

期刊

BMC BIOINFORMATICS
卷 22, 期 SUPPL 6, 页码 -

出版社

BMC
DOI: 10.1186/s12859-021-04084-w

关键词

Convolutional neural network; Deep learning; Taxonomic assignment; Genomic coverage; Naive Bayesian network

资金

  1. Yong Loo Lin School of Medicine, National University of Singapore
  2. Department of Biochemistry, National University of Singapore

向作者/读者索取更多资源

This study developed a pipeline that combines a multi-task learning model and Bayesian ranking approach to accurately identify and rank human viruses with divergent sequences. By incorporating genomic region assignment and Bayesian methodology, this pipeline improves the accuracy and sensitivity of taxonomic assignment from sequence data.
Background: Taxonomic assignment is a key step in the identification of human viral pathogens. Current tools for taxonomic assignment from sequencing reads based on alignment or alignment-free k-mer approaches may not perform optimally in cases where the sequences diverge significantly from the reference sequences. Furthermore, many tools may not incorporate the genomic coverage of assigned reads as part of overall likelihood of a correct taxonomic assignment for a sample. Results: In this paper, we describe the development of a pipeline that incorporates a multi-task learning model based on convolutional neural network (MT-CNN) and a Bayesian ranking approach to identify and rank the most likely human virus from sequence reads. For taxonomic assignment of reads, the MT-CNN model outperformed Kraken 2, Centrifuge, and Bowtie 2 on reads generated from simulated divergent HIV-1 genomes and was more sensitive in identifying SARS as the closest relation in four RNA sequencing datasets for SARS-CoV-2 virus. For genomic region assignment of assigned reads, the MT-CNN model performed competitively compared with Bowtie 2 and the region assignments were used for estimation of genomic coverage that was incorporated into a naive Bayesian network together with the proportion of taxonomic assignments to rank the likelihood of candidate human viruses from sequence data. Conclusions: We have developed a pipeline that combines a novel MT-CNN model that is able to identify viruses with divergent sequences together with assignment of the genomic region, with a Bayesian approach to ranking of taxonomic assignments by taking into account both the number of assigned reads and genomic coverage. The pipeline is available at GitHub via https://github.com/MaHaoran627/CNN_Virus.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据