4.1 Article

DrugFinder: Druggable Protein Identification Model Based on Pre-Trained Models and Evolutionary Information

Journal

ALGORITHMS
Volume 16, Issue 6, Pages -

Publisher

MDPI
DOI: 10.3390/a16060263

Keywords

druggable protein; transformer-based models; machine learning; feature extraction

Ask authors/readers for more resources

Identifying druggable proteins is crucial for drug development. Traditional structure-based methods are time-consuming and expensive, leading to the increasing shift towards sequence-based methods. In this study, we propose a sequence-based model called DrugFinder, which extracts features from the embedding output of a pre-trained protein model and the evolutionary information of the position-specific scoring matrix. We used the random forest method to select features and tested them on various machine learning classifiers, with the XGB model achieving the best results. DrugFinder showed significantly better performance than existing methods, achieving high accuracy, sensitivity, and specificity on independent test sets.
The identification of druggable proteins has always been the core of drug development. Traditional structure-based identification methods are time-consuming and costly. As a result, more and more researchers have shifted their attention to sequence-based methods for identifying druggable proteins. We propose a sequence-based druggable protein identification model called DrugFinder. The model extracts the features from the embedding output of the pre-trained protein model Prot_T5_Xl_Uniref50 (T5) and the evolutionary information of the position-specific scoring matrix (PSSM). Afterwards, to remove redundant features and improve model performance, we used the random forest (RF) method to select features, and the selected features were trained and tested on multiple different machine learning classifiers, including support vector machines (SVM), RF, naive Bayes (NB), extreme gradient boosting (XGB), and k-nearest neighbors (KNN). Among these classifiers, the XGB model achieved the best results. DrugFinder reached an accuracy of 94.98%, sensitivity of 96.33% and specificity of 96.83% on the independent test set, which is much better than the results from existing identification methods. Our model also performed well on another additional test set related to tumors, achieving an accuracy of 88.71% and precision of 93.72%. This further demonstrates the strong generalization capability of the model.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.1
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available