☆ 4.6 Article

DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets

PLOS COMPUTATIONAL BIOLOGY (2022)

期刊

PLOS COMPUTATIONAL BIOLOGY

卷 18, 期 10, 页码 -

出版社

PUBLIC LIBRARY SCIENCE

DOI: 10.1371/journal.pcbi.1010610

关键词

类别

Biochemical Research Methods Mathematical & Computational Biology

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Automatic clustering using Density Peak Clustering algorithm was applied to the UniRef50 protein database, resulting in the identification of thousands of protein clusters. The classification results were compared with existing resources and revealed some unannotated protein families.

Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds similar to 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 72% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.

DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets

期刊

PLOS COMPUTATIONAL BIOLOGY

出版社

PUBLIC LIBRARY SCIENCE

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets

期刊

PLOS COMPUTATIONAL BIOLOGY

出版社

PUBLIC LIBRARY SCIENCE

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文