4.7 Article

Accurate Representation of Protein-Ligand Structural Diversity in the Protein Data Bank (PDB)

Journal

Publisher

MDPI
DOI: 10.3390/ijms21062243

Keywords

protein-ligand complexes; dataset; clustering; structural alignment; refinement

Funding

  1. Ministry of Research (France)
  2. Discngine S.A.S., University Paris Diderot, Sorbonne, Paris Cite (France)
  3. University of La Reunion, Reunion Island, National Institute for Blood Transfusion (INTS, France)
  4. National Institute for Health and Medical Research (INSERM, France)
  5. Labex GR-Ex
  6. program Investissements d'Avenir of the French National Research Agency [ANR-11-LABX-0051, ANR-11-IDEX-0005-02]
  7. Indo-French Centre for the Promotion of Advanced Research/CEFIPRA [5302-2]

Ask authors/readers for more resources

The number of available protein structures in the Protein Data Bank (PDB) has considerably increased in recent years. Thanks to the growth of structures and complexes, numerous large-scale studies have been done in various research areas, e.g., protein-protein, protein-DNA, or in drug discovery. While protein redundancy was only simply managed using simple protein sequence identity threshold, the similarity of protein-ligand complexes should also be considered from a structural perspective. Hence, the protein-ligand duplicates in the PDB are widely known, but were never quantitatively assessed, as they are quite complex to analyze and compare. Here, we present a specific clustering of protein-ligand structures to avoid bias found in different studies. The methodology is based on binding site superposition, and a combination of weighted Root Mean Square Deviation (RMSD) assessment and hierarchical clustering. Repeated structures of proteins of interest are highlighted and only representative conformations were conserved for a non-biased view of protein distribution. Three types of cases are described based on the number of distinct conformations identified for each complex. Defining these categories decreases by 3.84-fold the number of complexes, and offers more refined results compared to a protein sequence-based method. Widely distinct conformations were analyzed using normalized B-factors. Furthermore, a non-redundant dataset was generated for future molecular interactions analysis or virtual screening studies.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available