4.7 Article

A data mining approach to discover unusual folding regions in genome sequences

期刊

KNOWLEDGE-BASED SYSTEMS
卷 15, 期 4, 页码 243-250

出版社

ELSEVIER SCIENCE BV
DOI: 10.1016/S0950-7051(01)00146-0

关键词

data mining; statistical model; RNA/DNA folding; UFR

向作者/读者索取更多资源

Numerous experiments and analyses of RNA structures have revealed that the local distinct structure closely correlates with the biological function. In this study, we present a data mining approach to discover such unusual folding regions (UFRs) in genome sequences. Our approach is a three-step procedure. During the first step, the quality of a local structure different from a random folding in a genomic sequence is evaluated by two z-scores, significance score (SIGSCR) and stability score (STBSCR) of the local segment. The two scores are computed by sliding a fixed window stepped a base along the sequence from the start to end position. Next, based on the non-central Student's t distribution theory we derive a linearly transformed non-central Student's t distribution (LTNSTD) to describe the distribution of SIGSCR and STBSCR computed in the sequence. In the third step, we extract these significant UFRs from the sequence whose SIGSCR and/or STBSCR are greater or less than a given threshold calculated from the derived LTNSTD. Our data mining approach is successfully applied to the complete genome of Mycoplasma genitalium (M. gen) and discovers these statistical extremes in the genome. By comparisons with the two scores computed from randomly shuffled sequences of the entire M. gen genome, our results demonstrate that the UFRs in the M. gen sequence are not selected by chance. These UFRs may imply an important structure role involved in their sequence information. (C) 2002 Elsevier Science B.V. All rights reserved.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据