4.7 Article

RSDB: representative protein sequence databases have high information content

向作者/读者索取更多资源

Motivation: Biological sequence databases are highly redundant for two main reasons. 1. various databanks keep redundant sequences with many identical and nearly identical sequences 2. natural sequences often have high sequence identities due to gene duplication. We wanted to know how many sequences call be removed before the databases start losing homology information. Can a database of sequences with mutual sequence identity of 50% or less provide us with the same amount of biological information as the original full database ? Results: Comparisons of nine representative sequence databases (RSDB) derived from full protein databanks showed that the information content of sequence databases is not linearly proportional to its size. An RSDB reduced to mutual sequence identity of around 50% (RSDB50) was equivalent to the original full database irt terms of the effectiveness of homology searching. It was a third of the full database size which resulted in a six times faster iterative profile searching. The RSDBs are produced at different granularity for efficient homology searching. Availability: All the RSDB files generated ann the full analysis results are available through internet: ftp://ftp.ebi.ac.uk/pub/contrib/jong/RSDB/ http://cyrah.ebi. ac.uk:1111/Proj/Bio/RSDB Contact: jong@biosophy/org.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据