4.6 Article

HostNet: improved sequence representation in deep neural networks for virus-host prediction

Journal

BMC BIOINFORMATICS
Volume 24, Issue 1, Pages -

Publisher

BMC
DOI: 10.1186/s12859-023-05582-9

Keywords

Virus-Host Prediction; Sequence Representation; Vectorization; Deep Learning-based Sequence Modeling

Ask authors/readers for more resources

The article presents HostNet, a deep learning framework for predicting virus hosts from genomic sequences. HostNet utilizes a Transformer-CNN-BiGRU architecture and two enhanced sequence representation modules to overcome the challenges of data deficiency and imbalance. The results show that HostNet outperforms the state-of-the-art deep learning-based method in host prediction accuracy and F1 score. The improved sequence representation modules significantly enhance HostNet's training generalization, performance in challenging classes, and stability.
BackgroundThe escalation of viruses over the past decade has highlighted the need to determine their respective hosts, particularly for emerging ones that pose a potential menace to the welfare of both human and animal life. Yet, the traditional means of ascertaining the host range of viruses, which involves field surveillance and laboratory experiments, is a laborious and demanding undertaking. A computational tool with the capability to reliably predict host ranges for novel viruses can provide timely responses in the prevention and control of emerging infectious diseases. The intricate nature of viral-host prediction involves issues such as data imbalance and deficiency. Therefore, developing highly accurate computational tools capable of predicting virus-host associations is a challenging and pressing demand.ResultsTo overcome the challenges of virus-host prediction, we present HostNet, a deep learning framework that utilizes a Transformer-CNN-BiGRU architecture and two enhanced sequence representation modules. The first module, k-mer to vector, pre-trains a background vector representation of k-mers from a broad range of virus sequences to address the issue of data deficiency. The second module, an adaptive sliding window, truncates virus sequences of various lengths to create a uniform number of informative and distinct samples for each sequence to address the issue of data imbalance. We assess HostNet's performance on a benchmark dataset of Rabies lyssavirus and an in-house dataset of Flavivirus. Our results show that HostNet surpasses the state-of-the-art deep learning-based method in host-prediction accuracies and F1 score. The enhanced sequence representation modules, significantly improve HostNet's training generalization, performance in challenging classes, and stability.ConclusionHostNet is a promising framework for predicting virus hosts from genomic sequences, addressing challenges posed by sparse and varying-length virus sequence data. Our results demonstrate its potential as a valuable tool for virus-host prediction in various biological contexts. Virus-host prediction based on genomic sequences using deep neural networks is a promising approach to identifying their potential hosts accurately and efficiently, with significant impacts on public health, disease prevention, and vaccine development.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available