Journal
JOURNAL OF COMPUTATIONAL BIOLOGY
Volume 16, Issue 12, Pages 1615-1634Publisher
MARY ANN LIEBERT, INC
DOI: 10.1089/cmb.2009.0198
Keywords
alignment-free; normal approximation; normal distribution; sequence alignment; word count statistics
Categories
Funding
- EPSRC [GR/R52183/01]
- BBSRC
- EPSRC
- National University of Singapore
- NIH [P50 HG 002790, R21AG032743]
- NATIONAL HUMAN GENOME RESEARCH INSTITUTE [P50HG002790, R21HG006199] Funding Source: NIH RePORTER
- NATIONAL INSTITUTE ON AGING [R21AG032743] Funding Source: NIH RePORTER
Ask authors/readers for more resources
Large-scale comparison of the similarities between two biological sequences is a major issue in computational biology; a fast method, the D-2 statistic, relies on the comparison of the k-tuple content for both sequences. Although it has been known for some years that the D-2 statistic is not suitable for this task, as it tends to be dominated by single-sequence noise, to date no suitable adjustments have been proposed. In this article, we suggest two new variants of the D-2 word count statistic, which we call D-2(S) and D-2*. For D-2(S), which is a self-standardized statistic, we show that the statistic is asymptotically normally distributed, when sequence lengths tend to infinity, and not dominated by the noise in the individual sequences. The second statistic, D-2*, outperforms D-2(S) in terms of power for detecting the relatedness between the two sequences in our examples; but although it is straightforward to simulate from the asymptotic distribution of D-2*, we cannot provide a closed form for power calculations.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available