☆ 4.6 Article

Effects of sequence features on machine-learned enzyme classification fidelity

BIOCHEMICAL ENGINEERING JOURNAL (2022)

Journal

BIOCHEMICAL ENGINEERING JOURNAL

Volume 187, Issue -, Pages -

Publisher

ELSEVIER

DOI: 10.1016/j.bej.2022.108612

Keywords

Enzyme classification; Sequence feature; Deep learning; Machine learning; Hidden markov model; Benchmarking

Funding

NIGMS MIRA ESI Award [1R35GM138265-01]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Assigning enzyme commission (EC) numbers using sequence information alone has been explored through various algorithms, with performance benchmarks showing the best accuracy in the range of 300-450 amino acids. Among the classifiers, ECpred demonstrated the best consistency in feature space, indicating its reliability in predicting enzyme classifications. This research provides insights into optimal design spaces for generating new synthetic enzymes and the common ranges of amino acid composition in annotated enzymes.

Assigning enzyme commission (EC) numbers using sequence information alone has been the subject of recent classification algorithms where statistics, homology and machine-learning based methods are used. This work benchmarks performance of a few of these algorithms as a function of sequence features such as chain length and amino acid composition (AAC). This enables determination of optimal classification windows for de novo sequence generation and enzyme design. Parallelization and visualization workflows are developed to observe the performance of the classifier over changing enzyme length, main EC class and AAC. We applied these workflows to the entire SwissProt database to date (n = 565928) using two, locally installable classifiers, ECpred and DeepEC, and collecting results from two other webserver-based tools, Deepre and BENZ-ws. All the classifiers exhibit peak performance in the range of 300-450 amino acids in length. Classifiers were most accurate at predicting translocases (EC-6) and were least accurate in determining hydrolases (EC-3) and oxidoreductases (EC-1). We also identified AAC ranges that are most common in the annotated enzymes and found that all classifiers work best in this common range. Among the four classifiers, ECpred showed the best consistency in changing feature space. These workflows can be used to benchmark new algorithms as they are developed and find optimum design spaces for the generation of new, synthetic enzymes.

Effects of sequence features on machine-learned enzyme classification fidelity

Journal

BIOCHEMICAL ENGINEERING JOURNAL

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Effects of sequence features on machine-learned enzyme classification fidelity

Journal

BIOCHEMICAL ENGINEERING JOURNAL

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper