☆ 4.4 Article

Human-in-the-loop Data Integration

PROCEEDINGS OF THE VLDB ENDOWMENT (2017)

Journal

PROCEEDINGS OF THE VLDB ENDOWMENT

Volume 10, Issue 12, Pages 2006-2017

Publisher

ASSOC COMPUTING MACHINERY

DOI: 10.14778/3137765.3137833

Keywords

Funding

973 Program of China [2015CB358700]
NSF of China [61632016, 61373024, 61602488, 61422205, 61472198]
ARC [DP170102726]
[FDCT/007/2016/AFJ]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Data integration aims to integrate data in different sources and provide users with a unified view. However, data integration cannot be completely addressed by purely automated methods. We propose a hybrid human-machine data integration framework that harnesses human ability to address this problem, and apply it initially to the problem of entity matching. The framework first uses rule-based algorithms to identify possible matching pairs and then utilizes the crowd to refine these candidate pairs in order to compute actual matching pairs. In the first step, we propose similarity-based rules and knowledge-based rules to obtain some candidate matching pairs, and develop effective algorithms to learn these rules based on some given positive and negative examples. We build a distributed in-memory system DIMA to efficiently apply these rules. In the second step, we propose a selection-inference-refine framework that uses the crowd to verify the candidate pairs. We first select some beneficial tasks to ask the crowd and then use transitivity and partial order to infer the answers of unasked tasks based on the crowdsourcing results of the asked tasks. Next we refine the inferred answers with high uncertainty due to the disagreement from the crowd. We develop a crowd-powered database system CDB and deploy it on real crowdsourcing platforms. CDB allows users to utilize a SQL-like language for processing crowd-based queries. Lastly, we provide emerging challenges in human-in-the-loop data integration.

Human-in-the-loop Data Integration

Journal

PROCEEDINGS OF THE VLDB ENDOWMENT

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Human-in-the-loop Data Integration

Journal

PROCEEDINGS OF THE VLDB ENDOWMENT

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper