4.7 Article

Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides

Journal

BIOINFORMATICS
Volume 38, Issue 5, Pages 1470-1472

Publisher

OXFORD UNIV PRESS
DOI: 10.1093/bioinformatics/btab838

Keywords

-

Funding

  1. Swedish Cancer Society [CAN 2017/685, CAN 2020/1269 PjF]
  2. Erling-Persson Family Foundation [12/12-2017, 22/9-2020]
  3. DART
  4. Rescuer EU-projects
  5. National Natural Science Foundation of China [32100505]
  6. Guangdong Science and Technology Department [2020B1212060018, 2020B1212030004]
  7. German Ministry of Research and Education [BMBF] [031A535A]
  8. Wellcome Trust [208391/Z/17/Z]

Ask authors/readers for more resources

The pypgatk package and pgdb workflow have been implemented to create proteogenomics databases based on ENSEMBL resources. The tools can generate protein sequences from different types of transcripts and take into account the impact of genomic variants on protein sequences. Using these tools, researchers have reanalyzed public datasets and identified a significant number of novel protein sequences.
We have implemented the pypgatk package and the pgdb workflow to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a three-frame translation of pseudogenes, lncRNAs and other non-canonical transcripts, such as those produced by alternative splicing events. It also includes exonic out-of-frame translation from otherwise canonical protein-coding mRNAs. Moreover, the tool enables the generation of variant protein sequences from multiple sources of genomic variants including COSMIC, cBioportal, gnomAD and mutations detected from sequencing of patient samples. pypgatk and pgdb provide multiple functionalities for database handling including optimized target/decoy generation by the algorithm DecoyPyrat. Finally, we have reanalyzed six public datasets in PRIDE by generating cell-type specific databases for 65 cell lines using the pypgatk and pgdb workflow, revealing a wealth of non-canonical or cryptic peptides amounting to >5% of the total number of peptides identified.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available