4.6 Article

MetaNovo: An open-source pipeline for probabilistic peptide discovery in complex metaproteomic datasets

Journal

PLOS COMPUTATIONAL BIOLOGY
Volume 19, Issue 6, Pages -

Publisher

PUBLIC LIBRARY SCIENCE
DOI: 10.1371/journal.pcbi.1011163

Keywords

-

Ask authors/readers for more resources

MetaNovo is an open-source software pipeline that integrates existing tools with a custom algorithm to produce targeted protein sequence databases for mass spectrometry metaproteomic analysis as an intermediate filtering step prior to standard sequence database search approaches. The software uses open-source tools to match peptide mass spectrometry spectra to sequence database entries and can be installed in a cluster or run standalone on a Linux machine. It is relevant for analyzing protein data from multiple organisms, where the exact species composition is unknown, and provides an avenue for analysis when accurate taxonomic characterization is not available.
Author summaryMetaNovo is an open-source software pipeline that integrates existing tools with a custom algorithm to produce targeted protein sequence databases for mass spectrometry metaproteomic analysis as an intermediate filtering step prior to standard sequence database search approaches. MetaNovo uses open-source tools to match peptide mass spectrometry spectra to sequence database entries in a parallelised and scalable manner and can be installed in a cluster or run standalone on a Linux machine. The software is scalable to the number of input files and search sequence database size. As inputs the software requires raw mass spectrometry data in MGF format, and a UniProt FASTA sequence database to search. The pipeline is relevant to users analysing protein data from multiple organisms, where the exact species composition is unknown, such as microbiome or environmental samples, and provides an avenue for analysis when matched metagenomics data or accurate taxonomic characterisation is not available as it infers the organisms and proteins present directly from the raw data and the parent sequence database. The targeted sequence database produced can be used with standard downstream peptide identification software that relies on a targeted input sequence database to search the raw data against and allows greater sensitivity in peptide spectral matching in metaproteomic datasets. BackgroundMicrobiome research is providing important new insights into the metabolic interactions of complex microbial ecosystems involved in fields as diverse as the pathogenesis of human diseases, agriculture and climate change. Poor correlations typically observed between RNA and protein expression datasets make it hard to accurately infer microbial protein synthesis from metagenomic data. Additionally, mass spectrometry-based metaproteomic analyses typically rely on focused search sequence databases based on prior knowledge for protein identification that may not represent all the proteins present in a set of samples. Metagenomic 16S rRNA sequencing only targets the bacterial component, while whole genome sequencing is at best an indirect measure of expressed proteomes. Here we describe a novel approach, MetaNovo, that combines existing open-source software tools to perform scalable de novo sequence tag matching with a novel algorithm for probabilistic optimization of the entire UniProt knowledgebase to create tailored sequence databases for target-decoy searches directly at the proteome level, enabling metaproteomic analyses without prior expectation of sample composition or metagenomic data generation and compatible with standard downstream analysis pipelines. ResultsWe compared MetaNovo to published results from the MetaPro-IQ pipeline on 8 human mucosal-luminal interface samples, with comparable numbers of peptide and protein identifications, many shared peptide sequences and a similar bacterial taxonomic distribution compared to that found using a matched metagenome sequence database-but simultaneously identified many more non-bacterial peptides than the previous approaches. MetaNovo was also benchmarked on samples of known microbial composition against matched metagenomic and whole genomic sequence database workflows, yielding many more MS/MS identifications for the expected taxa, with improved taxonomic representation, while also highlighting previously described genome sequencing quality concerns for one of the organisms, and identifying an experimental sample contaminant without prior expectation. ConclusionsBy estimating taxonomic and peptide level information directly on microbiome samples from tandem mass spectrometry data, MetaNovo enables the simultaneous identification of peptides from all domains of life in metaproteome samples, bypassing the need for curated sequence databases to search. We show that the MetaNovo approach to mass spectrometry metaproteomics is more accurate than current gold standard approaches of tailored or matched genomic sequence database searches, can identify sample contaminants without prior expectation and yields insights into previously unidentified metaproteomic signals, building on the potential for complex mass spectrometry metaproteomic data to speak for itself.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available