4.4 Article

On the String Matching with k Differences in DNA Databases

期刊

PROCEEDINGS OF THE VLDB ENDOWMENT
卷 14, 期 6, 页码 903-915

出版社

ASSOC COMPUTING MACHINERY
DOI: 10.14778/3447689.3447695

关键词

-

向作者/读者索取更多资源

This paper discusses an efficient index mechanism for string matching with k differences, particularly suitable for searching DNA databases. The time complexity of the method is bounded by O(k * |T|), with promising results shown in extensive experiments.
In this paper, we discuss an efficient and effective index mechanism for the string matching with k differences, by which we will find all the substrings of a target string y of length n that align with a pattern string x of length m with not more than k insertions, deletions, and mismatches. A typical application is the searching of a DNA database, where the size of a genome sequence in the database is much larger than that of a pattern. For example, n is often on the order of millions or billions while m is just a hundred or a thousand. The main idea of our method is to transform y to a BWTarray as an index, denoted as BWT(y), and search x against it. The time complexity of our method is bounded by O(k . vertical bar T vertical bar), where T is a tree structure dynamically generated during a search of BWT(y). The average value of vertical bar T vertical bar is bounded by O(vertical bar Sigma vertical bar(2k)), where Sigma is an alphabet from which we take symbols to make up target and pattern strings. This time complexity is better than previous strategies when k <= O(log(vertical bar Sigma vertical bar) n). The general working process consists of two steps. In the first step, xis decomposed into a series of l small subpatterns, and BWT(y) is utilized to speed up the process to figure out all the occurrences of such subpatterns with left perpendicular k/l right perpendicular differences. In the second step, all the found occurrences in the first step will be rechecked to see whether they really match x, but with k differences. Extensive experiments have been conducted, which show that our method for this problem is promising.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.4
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据