Journal
BRIEFINGS IN BIOINFORMATICS
Volume 22, Issue 5, Pages -Publisher
OXFORD UNIV PRESS
DOI: 10.1093/bib/bbab005
Keywords
contextualized word embedding; BERT; convolutional neural network; biological sequence; DNA enhancer; NLP transformer
Funding
- Research Grant for Newly Hired Faculty, Taipei Medical University [TMU108-AE1-B26]
- Higher Education Sprout Project, Ministry of Education, Taiwan [DP2-109-21121-01-A-06]
Ask authors/readers for more resources
The study incorporated BERT-based multilingual model in bioinformatics to represent DNA sequence information, showing significant improvement in sensitivity, specificity, accuracy, and Matthews correlation coefficient for DNA enhancer prediction. Advanced experiments revealed the potential of deep learning, particularly through 2D CNN, in learning BERT features for biological modeling.
Recently, language representation models have drawn a lot of attention in the natural language processing field due to their remarkable results. Among them, bidirectional encoder representations from transformers (BERT) has proven to be a simple, yet powerful language model that achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embedding to capture the semantics and context of the words in which they appeared. In this study, we present a novel technique by incorporating BERT-based multilingual model in bioinformatics to represent the information of DNA sequences. We treated DNA sequences as natural sentences and then used BERT models to transform them into fixed-length numerical matrices. As a case study, we applied our method to DNA enhancer prediction, which is a well-known and challenging problem in this field. We then observed that our BERT-based features improved more than 5-10% in terms of sensitivity, specificity, accuracy and Matthews correlation coefficient compared to the current state-of-the-art features in bioinformatics. Moreover, advanced experiments show that deep learning (as represented by 2D convolutional neural networks; CNN) holds potential in learning BERT features better than other traditional machine learning techniques. In conclusion, we suggest that BERT and 2D CNNs could open a new avenue in biological modeling using sequence information.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available