☆ 4.7 Article

A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information

BRIEFINGS IN BIOINFORMATICS (2021)

Journal

BRIEFINGS IN BIOINFORMATICS

Volume 22, Issue 5, Pages -

Publisher

OXFORD UNIV PRESS

DOI: 10.1093/bib/bbab005

Keywords

contextualized word embedding; BERT; convolutional neural network; biological sequence; DNA enhancer; NLP transformer

Funding

Research Grant for Newly Hired Faculty, Taipei Medical University [TMU108-AE1-B26]
Higher Education Sprout Project, Ministry of Education, Taiwan [DP2-109-21121-01-A-06]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

The study incorporated BERT-based multilingual model in bioinformatics to represent DNA sequence information, showing significant improvement in sensitivity, specificity, accuracy, and Matthews correlation coefficient for DNA enhancer prediction. Advanced experiments revealed the potential of deep learning, particularly through 2D CNN, in learning BERT features for biological modeling.

Recently, language representation models have drawn a lot of attention in the natural language processing field due to their remarkable results. Among them, bidirectional encoder representations from transformers (BERT) has proven to be a simple, yet powerful language model that achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embedding to capture the semantics and context of the words in which they appeared. In this study, we present a novel technique by incorporating BERT-based multilingual model in bioinformatics to represent the information of DNA sequences. We treated DNA sequences as natural sentences and then used BERT models to transform them into fixed-length numerical matrices. As a case study, we applied our method to DNA enhancer prediction, which is a well-known and challenging problem in this field. We then observed that our BERT-based features improved more than 5-10% in terms of sensitivity, specificity, accuracy and Matthews correlation coefficient compared to the current state-of-the-art features in bioinformatics. Moreover, advanced experiments show that deep learning (as represented by 2D convolutional neural networks; CNN) holds potential in learning BERT features better than other traditional machine learning techniques. In conclusion, we suggest that BERT and 2D CNNs could open a new avenue in biological modeling using sequence information.

A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information

Journal

BRIEFINGS IN BIOINFORMATICS

Publisher

OXFORD UNIV PRESS

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information

Journal

BRIEFINGS IN BIOINFORMATICS

Publisher

OXFORD UNIV PRESS

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper