4.7 Article

BERT contextual embeddings for taxonomic classification of bacterial DNA sequences

期刊

EXPERT SYSTEMS WITH APPLICATIONS
卷 208, 期 -, 页码 -

出版社

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.eswa.2022.117972

关键词

DNA; Taxonomic classification; BERT; Contextual embedding; Deep Learning; Convolutional Neural Network

向作者/读者索取更多资源

This paper focuses on biological sequence classification and inference, exploring efficient representations of biological sequences. A pretrained BERT model and CNN are used to provide a complete prediction model. Data augmentation is applied to enhance classification accuracy, showing promising performance of using contextual embeddings to represent biological sequences.
Biological taxonomic classification is an important task needed for the identification and discovery of organisms, as well as the inference of their evolutionary relationships. The order and structure of biological sequence components has an essential and primary role in what the sequence's identity and function is. In order to be able to efficiently differentiate between different bacterial categories, interactions and positions of the biological components in sequences must be known - which is an essential challenge in biological sequence classification. In this light, a considerable amount of recent research has been made to explore efficient representations of biological sequences such as spectral k-mer representation, one-hot encoding, Hilbert space curves and classical word embeddings such as Word2Vec. This paper identifies the taxonomic classification of bacterial 16S rRNA genes at five resolutions mapping hierarchical taxonomic ranks. A Bidirectional Encoder Representations from Transformers (BERT) model is pretrained using biological sequences, which to the best of our knowledge is the first time BERT has been trained with such sequences. A complete prediction model is then proposed - BioSeqBERT-CNN - that initially extracts contextual embeddings representations of DNA sequences using the pretrained BERT model. Extracted representations are further used for taxonomic classification through a Convolutional Neural Network (CNN). For boosting the deep learning classification performance, a data augmentation step is applied. Classification with the original dataset on the most fine-grained rank produced an accuracy of 93.5%, which surpasses that of recent works by 1.5-24.3%. Using data augmentation, an accuracy of 99.9% is achieved, which exceeds values of recent works by a minimum and maximum of 7.9% and 30.7%, respectively on the most fine-grained taxonomic rank. This exhibits promising performance promoting the study of using contextual embeddings to represent biological sequences, with Deep Learning networks.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据