Journal
INFORMATION SYSTEMS
Volume 65, Issue -, Pages 137-157Publisher
PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.is.2016.12.001
Keywords
Meta-blocking; Map/Reduce model; Parallelization
Categories
Funding
- EU H2020 Big Data Europe project [644564]
- EU FP7 SemData project [612551]
Ask authors/readers for more resources
Entity resolution constitutes a crucial task for many applications, but has an inherently quadratic complexity. In order to enable entity resolution to scale to large volumes of data, blocking is typically employed: it clusters similar entities into (overlapping) blocks so that it suffices to perform comparisons only within each block. To further increase efficiency, Meta-blocking is being used to clean the overlapping blocks from unnecessary comparisons, increasing precision by orders of magnitude at a small cost in recall. Despite its high time efficiency though, using Meta-blocking in practice to solve entity resolution problem on very large datasets is still challenging: applying it to 7.4 million entities takes (almost) 8 full days on a modern high-end server. In this paper, we introduce scalable algorithms for Meta-blocking, exploiting the MapReduce framework. Specifically, we describe a strategy for parallel execution that explicitly targets the core concept of Meta blocking, the blocking graph. Furthermore, we propose two more advanced strategies, aiming to reduce the overhead of data exchange. The comparison-based strategy creates the blocking graph implicitly, while the entity-based strategy is independent of the blocking graph, employing fewer MapReduce jobs with a more elaborate processing. We also introduce a load balancing algorithm that distributes the computationally intensive workload evenly among the available compute nodes. Our experimental analysis verifies the feasibility and superiority of our advanced strategies, and demonstrates their scalability to very large datasets.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available