Journal
BIOINFORMATICS
Volume 29, Issue 21, Pages 2690-2698Publisher
OXFORD UNIV PRESS
DOI: 10.1093/bioinformatics/btt462
Keywords
-
Categories
Funding
- Oxford Martin School
- US NIH [R21HG006199]
- NSF [DMS-1043075]
- OCE [1136818]
- National Natural Science Foundation of China [31171262, 11021463]
- National Key Basic Research Project of China [2009CB918503]
- EPSRC [EP/K032402/1] Funding Source: UKRI
- Engineering and Physical Sciences Research Council [EP/K032402/1] Funding Source: researchfish
- Directorate For Geosciences
- Division Of Ocean Sciences [1136818] Funding Source: National Science Foundation
Ask authors/readers for more resources
Motivation: Recently, a range of new statistics have become available for the alignment-free comparison of two sequences based on k-tuple word content. Here, we extend these statistics to the simultaneous comparison of more than two sequences. Our suite of statistics contains, first, C-l* and C-l(S), extensions of statistics for pairwise comparison of the joint k-tuple content of all the sequences, and second, (C-2*) over bar, <(C-2(S))over bar> and <(C-2(geo))over bar>, averages of sums of pairwise comparison statistics. The two tasks we consider are, first, to identify sequences that are similar to a set of target sequences, and, second, to measure the similarity within a set of sequences. Results: Our investigation uses both simulated data as well as cis-regulatory module data where the task is to identify cis-regulatory modules with similar transcription factor binding sites. We find that although for real data, all of our statistics show a similar performance, on simulated data the Shepp-type statistics are in some instances outperformed by star-type statistics. The multiple alignment-free statistics are more sensitive to contamination in the data than the pairwise average statistics.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available