☆ 4.5 Article

Hierarchical filtering: improving similar substring matching under edit distance

WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS (2023)

Journal

WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS

Volume 26, Issue 4, Pages 1967-2001

Publisher

SPRINGER

DOI: 10.1007/s11280-022-01128-w

Keywords

Similar substring matching; Hierarchical filtering; Tree-based filtering framework; Edit distance

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper addresses the problem of similar substring matching with edit distance constraints. Existing methods have an issue with balancing the cost of filtering and verification, so a new hierarchical filtering paradigm is proposed to address this issue. The cost is further reduced by eliminating duplicate filters.

Similar substring matching, as an essential operation in applications including read mapping and text retrieval, has attracted significant attention in the research community. In this paper, we study the problem of similar substring matching with edit distance constraints. Existing methods generally utilize a filtering-and-verification framework to solve this problem - a filtering procedure is employed to reduce the searching space before going to a computationally expensive verification step, and the efficiency depends critically on balancing the cost of filtering and verification. The common filtering paradigm is based on the principle of Pigeonhole stating that a matching result must exactly match at least a certain number of substrings from the query, where the substrings act as a filter. However, the polynomial growth of filters caused by enlarging the number of substrings in filters, leading to the cost of filtering and verification is not well-balanced for the existing methods. To this end, we propose a novel filtering paradigm hierarchical filtering, aiming at achieving a fine-grained balance on the cost of filtering and verification. Unlike using a fixed number of substrings in a filter, our method allows the filters contain a different number of substrings that avoids the polynomial growth of filters. The filters are picked in accord with a scoring metric. We devise a tree-based filtering framework for hierarchical filtering. Also, the cost of filtering and verification is further reduced by eliminating the duplication of filters. Extensive experiments have been conducted on four real-world datasets by comparing to state-of-the-art methods Hobbes3, BWA, and BLAST, etc. The results show that our method outperforms the competing methods under a wide range of parameter settings.

Hierarchical filtering: improving similar substring matching under edit distance

Journal

WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS

Publisher

SPRINGER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Hierarchical filtering: improving similar substring matching under edit distance

Journal

WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS

Publisher

SPRINGER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper