☆ 4.4 Article

Human-in-the-loop Data Integration

PROCEEDINGS OF THE VLDB ENDOWMENT (2017)

期刊

PROCEEDINGS OF THE VLDB ENDOWMENT

卷 10, 期 12, 页码 2006-2017

出版社

ASSOC COMPUTING MACHINERY

DOI: 10.14778/3137765.3137833

关键词

类别

Computer Science, Information Systems Computer Science, Theory & Methods

资金

973 Program of China [2015CB358700]
NSF of China [61632016, 61373024, 61602488, 61422205, 61472198]
ARC [DP170102726]
[FDCT/007/2016/AFJ]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Data integration aims to integrate data in different sources and provide users with a unified view. However, data integration cannot be completely addressed by purely automated methods. We propose a hybrid human-machine data integration framework that harnesses human ability to address this problem, and apply it initially to the problem of entity matching. The framework first uses rule-based algorithms to identify possible matching pairs and then utilizes the crowd to refine these candidate pairs in order to compute actual matching pairs. In the first step, we propose similarity-based rules and knowledge-based rules to obtain some candidate matching pairs, and develop effective algorithms to learn these rules based on some given positive and negative examples. We build a distributed in-memory system DIMA to efficiently apply these rules. In the second step, we propose a selection-inference-refine framework that uses the crowd to verify the candidate pairs. We first select some beneficial tasks to ask the crowd and then use transitivity and partial order to infer the answers of unasked tasks based on the crowdsourcing results of the asked tasks. Next we refine the inferred answers with high uncertainty due to the disagreement from the crowd. We develop a crowd-powered database system CDB and deploy it on real crowdsourcing platforms. CDB allows users to utilize a SQL-like language for processing crowd-based queries. Lastly, we provide emerging challenges in human-in-the-loop data integration.

Human-in-the-loop Data Integration

期刊

PROCEEDINGS OF THE VLDB ENDOWMENT

出版社

ASSOC COMPUTING MACHINERY

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Human-in-the-loop Data Integration

期刊

PROCEEDINGS OF THE VLDB ENDOWMENT

出版社

ASSOC COMPUTING MACHINERY

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文