4.7 Review

False discovery rate: the Achilles' heel of proteogenomics

Journal

BRIEFINGS IN BIOINFORMATICS
Volume 23, Issue 5, Pages -

Publisher

OXFORD UNIV PRESS
DOI: 10.1093/bib/bbac163

Keywords

false discovery rate; proteogenomics; FDR; gene annotation; variants; novel peptides; NGS; RNA-Seq; ORFs; mass spectrometry; shotgun proteomics

Funding

  1. Indian Council of Medical Research -Senior Research Fellowship [BIC/11/(17)/2015]
  2. Department of Science and Technology, India's DST-INSPIRE Fellowship
  3. Department of Biotechnology, India, Big Data Initiative grant [BT/PR16456/BID/7/624/2016]
  4. Department of Biotechnology, India [GAP0134]
  5. Translational Research Program at THSTI - Department of Biotechnology, India

Ask authors/readers for more resources

Proteogenomics integrates genome and proteome analysis to improve genome annotation and discover new insights by controlling error rates. However, challenges arise due to database size inflation, leading to reduced sensitivity and specificity in proteogenomic studies. Understanding key factors and applying modified strategies can enhance interpretation of mass spectrometry data and effectively manage false positives and negatives.
Proteogenomics refers to the integrated analysis of the genome and proteome that leverages mass-spectrometry (MS)-based proteomics data to improve genome annotations, understand gene expression control through proteoforms and find sequence variants to develop novel insights for disease classification and therapeutic strategies. However, proteogenomic studies often suffer from reduced sensitivity and specificity due to inflated database size. To control the error rates, proteogenomics depends on the target-decoy search strategy, the de-facto method for false discovery rate (FDR) estimation in proteomics. The proteogenomic databases constructed from three- or six-frame nucleotide database translation not only increase the search space and compute-time but also violate the equivalence of target and decoy databases. These searches result in poorer separation between target and decoy scores, leading to stringent FDR thresholds. Understanding these factors and applying modified strategies such as two-pass database search or peptide-class-specific FDR can result in a better interpretation of MS data without introducing additional statistical biases. Based on these considerations, a user can interpret the proteogenomics results appropriately and control false positives and negatives in a more informed manner. In this review, first, we briefly discuss the proteogenomic workflows and limitations in database construction, followed by various considerations that can influence potential novel discoveries in a proteogenomic study. We conclude with suggestions to counter these challenges for better proteogenomic data interpretation.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available