4.4 Article

On the String Matching with k Differences in DNA Databases

Journal

PROCEEDINGS OF THE VLDB ENDOWMENT
Volume 14, Issue 6, Pages 903-915

Publisher

ASSOC COMPUTING MACHINERY
DOI: 10.14778/3447689.3447695

Keywords

-

Ask authors/readers for more resources

This paper discusses an efficient index mechanism for string matching with k differences, particularly suitable for searching DNA databases. The time complexity of the method is bounded by O(k * |T|), with promising results shown in extensive experiments.
In this paper, we discuss an efficient and effective index mechanism for the string matching with k differences, by which we will find all the substrings of a target string y of length n that align with a pattern string x of length m with not more than k insertions, deletions, and mismatches. A typical application is the searching of a DNA database, where the size of a genome sequence in the database is much larger than that of a pattern. For example, n is often on the order of millions or billions while m is just a hundred or a thousand. The main idea of our method is to transform y to a BWTarray as an index, denoted as BWT(y), and search x against it. The time complexity of our method is bounded by O(k . vertical bar T vertical bar), where T is a tree structure dynamically generated during a search of BWT(y). The average value of vertical bar T vertical bar is bounded by O(vertical bar Sigma vertical bar(2k)), where Sigma is an alphabet from which we take symbols to make up target and pattern strings. This time complexity is better than previous strategies when k <= O(log(vertical bar Sigma vertical bar) n). The general working process consists of two steps. In the first step, xis decomposed into a series of l small subpatterns, and BWT(y) is utilized to speed up the process to figure out all the occurrences of such subpatterns with left perpendicular k/l right perpendicular differences. In the second step, all the found occurrences in the first step will be rechecked to see whether they really match x, but with k differences. Extensive experiments have been conducted, which show that our method for this problem is promising.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.4
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available