4.8 Article

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation

期刊

NUCLEIC ACIDS RESEARCH
卷 49, 期 D1, 页码 D1020-D1028

出版社

OXFORD UNIV PRESS
DOI: 10.1093/nar/gkaa1105

关键词

-

资金

  1. Intramural Research Program of the National Library of Medicine at National Institutes of Health/DHHS
  2. Intramural Research Program of the National Library of Medicine at the National Institutes of Health/DHHS

向作者/读者索取更多资源

The RefSeq project at NCBI contains a vast number of bacterial and archaeal genomes and proteins, with a focus on reducing spurious annotation through the use of expanded protein family models. The Protein Family Models Entrez database provides users with access to the PFMs, supporting multi-genome analyses and connections to the literature. The reference and representative genome set of prokaryotic genomes within RefSeq is regularly recalculated and available for download and BLAST searches.
The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 protein profile hidden Markov models (HMMs), 12 300 BlastRules and 36 000 curated CDD architectures. As a result, >122 million or 79% of RefSeq proteins are now named based on a match to a curated PFM. Gene symbols, Enzyme Commission numbers or supporting publication attributes are available on over 40% of the PFMs and are inherited by the proteins and features they name, facilitating multi-genome analyses and connections to the literature. In adherence with the principles of FAIR (findable, accessible, interoperable, reusable), the PFMs are available in the Protein Family Models Entrez database to any user. Finally, the reference and representative genome set, a taxonomically diverse subset of RefSeq prokaryotic genomes, is now recalculated regularly and available for download and homology searches with BLAST. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据