4.7 Article

A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information

Journal

BRIEFINGS IN BIOINFORMATICS
Volume 22, Issue 5, Pages -

Publisher

OXFORD UNIV PRESS
DOI: 10.1093/bib/bbab005

Keywords

contextualized word embedding; BERT; convolutional neural network; biological sequence; DNA enhancer; NLP transformer

Funding

  1. Research Grant for Newly Hired Faculty, Taipei Medical University [TMU108-AE1-B26]
  2. Higher Education Sprout Project, Ministry of Education, Taiwan [DP2-109-21121-01-A-06]

Ask authors/readers for more resources

The study incorporated BERT-based multilingual model in bioinformatics to represent DNA sequence information, showing significant improvement in sensitivity, specificity, accuracy, and Matthews correlation coefficient for DNA enhancer prediction. Advanced experiments revealed the potential of deep learning, particularly through 2D CNN, in learning BERT features for biological modeling.
Recently, language representation models have drawn a lot of attention in the natural language processing field due to their remarkable results. Among them, bidirectional encoder representations from transformers (BERT) has proven to be a simple, yet powerful language model that achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embedding to capture the semantics and context of the words in which they appeared. In this study, we present a novel technique by incorporating BERT-based multilingual model in bioinformatics to represent the information of DNA sequences. We treated DNA sequences as natural sentences and then used BERT models to transform them into fixed-length numerical matrices. As a case study, we applied our method to DNA enhancer prediction, which is a well-known and challenging problem in this field. We then observed that our BERT-based features improved more than 5-10% in terms of sensitivity, specificity, accuracy and Matthews correlation coefficient compared to the current state-of-the-art features in bioinformatics. Moreover, advanced experiments show that deep learning (as represented by 2D convolutional neural networks; CNN) holds potential in learning BERT features better than other traditional machine learning techniques. In conclusion, we suggest that BERT and 2D CNNs could open a new avenue in biological modeling using sequence information.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available