4.7 Article

Critical evaluation of the use of artificial data for machine learning based de novo peptide identification

Journal

COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL
Volume 21, Issue -, Pages 2732-2743

Publisher

ELSEVIER
DOI: 10.1016/j.csbj.2023.04.014

Keywords

Artificial data; Synthetic data; Peptide sequencing; Noise

Ask authors/readers for more resources

Proteins are crucial for living cells and proteomics, the study of their expression, has diverse applications. Peptide identification is typically done by matching mass spectra to a protein database, but de novo methods can also be used. This study critically analyzes the use of artificial data for training and evaluating de novo peptide identification algorithms.
Proteins are essential components of all living cells and so the study of their in situ expression, proteomics, has wide reaching applications. Peptide identification in proteomics typically relies on matching high resolution tandem mass spectra to a protein database but can also be performed de novo. While artificial spectra have been successfully incorporated into database search pipelines to increase peptide identification rates, little work has been done to investigate the utility of artificial spectra in the context of de novo peptide identification. Here, we perform a critical analysis of the use of artificial data for the training and evaluation of de novo peptide identification algorithms. First, we classify the different fragment ion types present in real spectra and then estimate the number of spurious matches using random peptides. We then categorise the different types of noise present in real spectra. Finally, we transfer this knowledge to artificial data and test the performance of a state-of-the-art de novo peptide identification algorithm trained using artificial spectra with and without relevant noise addition. Noise supplementation increased artificial training data performance from 30% to 77% of real training data peptide recall. While real data performance was not fully replicated, this work provides the first steps towards an artificial spectrum framework for the training and evaluation of de novo peptide identification algorithms. Further enhanced artificial spectra may allow for more in depth analysis of de novo algorithms as well as alleviating the reliance on database searches for training data. (c) 2023 Published by Elsevier B.V. on behalf of Research Network of Computational and Structural Biotechnology. This is an open access article under the CC BY license (http://creativecommons.org/licenses/ by/4.0/).

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available