4.5 Article

Alignment-Free Sequence Comparison (I): Statistics and Power

Journal

JOURNAL OF COMPUTATIONAL BIOLOGY
Volume 16, Issue 12, Pages 1615-1634

Publisher

MARY ANN LIEBERT, INC
DOI: 10.1089/cmb.2009.0198

Keywords

alignment-free; normal approximation; normal distribution; sequence alignment; word count statistics

Funding

  1. EPSRC [GR/R52183/01]
  2. BBSRC
  3. EPSRC
  4. National University of Singapore
  5. NIH [P50 HG 002790, R21AG032743]
  6. NATIONAL HUMAN GENOME RESEARCH INSTITUTE [P50HG002790, R21HG006199] Funding Source: NIH RePORTER
  7. NATIONAL INSTITUTE ON AGING [R21AG032743] Funding Source: NIH RePORTER

Ask authors/readers for more resources

Large-scale comparison of the similarities between two biological sequences is a major issue in computational biology; a fast method, the D-2 statistic, relies on the comparison of the k-tuple content for both sequences. Although it has been known for some years that the D-2 statistic is not suitable for this task, as it tends to be dominated by single-sequence noise, to date no suitable adjustments have been proposed. In this article, we suggest two new variants of the D-2 word count statistic, which we call D-2(S) and D-2*. For D-2(S), which is a self-standardized statistic, we show that the statistic is asymptotically normally distributed, when sequence lengths tend to infinity, and not dominated by the noise in the individual sequences. The second statistic, D-2*, outperforms D-2(S) in terms of power for detecting the relatedness between the two sequences in our examples; but although it is straightforward to simulate from the asymptotic distribution of D-2*, we cannot provide a closed form for power calculations.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available