4.6 Article

Identification of Thermophilic Proteins Based on Sequence-Based Bidirectional Representations from Transformer-Embedding Features

Journal

APPLIED SCIENCES-BASEL
Volume 13, Issue 5, Pages -

Publisher

MDPI
DOI: 10.3390/app13052858

Keywords

thermophilic proteins; BERT; machine learning; imbalanced dataset; deep learning

Ask authors/readers for more resources

Thermophilic proteins have the potential to be used as biocatalysts in biotechnology. BertThermo, a model using BERT as an automatic feature extraction tool, achieved high accuracy in identifying thermophilic proteins. It outperformed previous predictive algorithms and demonstrated robustness in various datasets.+
Thermophilic proteins have great potential to be utilized as biocatalysts in biotechnology. Machine learning algorithms are gaining increasing use in identifying such enzymes, reducing or even eliminating the need for experimental studies. While most previously used machine learning methods were based on manually designed features, we developed BertThermo, a model using Bidirectional Encoder Representations from Transformers (BERT), as an automatic feature extraction tool. This method combines a variety of machine learning algorithms and feature engineering methods, while relying on single-feature encoding based on the protein sequence alone for model input. BertThermo achieved an accuracy of 96.97% and 97.51% in 5-fold cross-validation and in independent testing, respectively, identifying thermophilic proteins more reliably than any previously described predictive algorithm. Additionally, BertThermo was tested by a balanced dataset, an imbalanced dataset and a dataset with homology sequences, and the results show that BertThermo was with the best robustness as comparied with state-of-the-art methods. The source code of BertThermo is available.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available