期刊
BIOINFORMATICS
卷 21, 期 21, 页码 4046-4053出版社
OXFORD UNIV PRESS
DOI: 10.1093/bioinformatics/bti657
关键词
-
类别
资金
- NIAID NIH HHS [N01 AI 95360] Funding Source: Medline
Motivation: Short sequence patterns frequently define regions of biological interest (binding sites, immune epitopes, primers, etc.), yet a large fraction of this information exists only within the scientific literature and is thus difficult to locate via conventional means (e.g. keyword queries or manual searches). We describe herein a system to accurately identify and classify sequence patterns from within large corpora using an n-gram Markov model (MM). Results: As expected, on test sets we found that identification of sequences with limited alphabets and/or regular structures such as nucleic acids (non-ambiguous) and peptide abbreviations (3-letter) was highly accurate, whereas classification of symbolic (1-letter) peptide strings with more complex alphabets was more problematic. The MM was used to analyze two very large, sequence-containing corpora: over 7.75 million Medline abstracts and 9000 full-text articles from Journal of Virology. Performance was benchmarked by comparing the results with Journal of Virology entries in two existing manually curated databases: VirOligo and the HLA Ligand Database. Performance estimates were 98 +/- 2% precision/84% recall for primer identification and classification and 67 +/- 6% precision/85% recall for peptide epitopes. We also find a dramatic difference between the amounts of sequence-related data reported in abstracts versus full text. Our results suggest that automated extraction and classification of sequence elements is a promising, low-cost means of sequence database curation and annotation.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据