Journal
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS
Volume 26, Issue 4, Pages 1967-2001Publisher
SPRINGER
DOI: 10.1007/s11280-022-01128-w
Keywords
Similar substring matching; Hierarchical filtering; Tree-based filtering framework; Edit distance
Ask authors/readers for more resources
This paper addresses the problem of similar substring matching with edit distance constraints. Existing methods have an issue with balancing the cost of filtering and verification, so a new hierarchical filtering paradigm is proposed to address this issue. The cost is further reduced by eliminating duplicate filters.
Similar substring matching, as an essential operation in applications including read mapping and text retrieval, has attracted significant attention in the research community. In this paper, we study the problem of similar substring matching with edit distance constraints. Existing methods generally utilize a filtering-and-verification framework to solve this problem - a filtering procedure is employed to reduce the searching space before going to a computationally expensive verification step, and the efficiency depends critically on balancing the cost of filtering and verification. The common filtering paradigm is based on the principle of Pigeonhole stating that a matching result must exactly match at least a certain number of substrings from the query, where the substrings act as a filter. However, the polynomial growth of filters caused by enlarging the number of substrings in filters, leading to the cost of filtering and verification is not well-balanced for the existing methods. To this end, we propose a novel filtering paradigm hierarchical filtering, aiming at achieving a fine-grained balance on the cost of filtering and verification. Unlike using a fixed number of substrings in a filter, our method allows the filters contain a different number of substrings that avoids the polynomial growth of filters. The filters are picked in accord with a scoring metric. We devise a tree-based filtering framework for hierarchical filtering. Also, the cost of filtering and verification is further reduced by eliminating the duplication of filters. Extensive experiments have been conducted on four real-world datasets by comparing to state-of-the-art methods Hobbes3, BWA, and BLAST, etc. The results show that our method outperforms the competing methods under a wide range of parameter settings.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available