☆ 3.8 Proceedings Paper

Parallel Rule Discovery from Large Datasets by Sampling

PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22) (2022)

期刊

PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22)

卷 -, 期 -, 页码 384-398

出版社

ASSOC COMPUTING MACHINERY

DOI: 10.1145/3514221.3526165

关键词

Rule discovery; data quality; sampling

类别

Computer Science, Information Systems Computer Science, Theory & Methods

资金

ERC [652976]
Royal SocietyWolfson Research Merit Award [WRM/R1/180014]
NSFC [61902274]
Longhua Science and Technology Innovation Bureau [LHKJCXJCYJ202003]
European Research Council (ERC) [652976] Funding Source: European Research Council (ERC)

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper proposes a multi-round sampling strategy for rule discovery in large datasets to ensure the accuracy and extractability of rules through precision and recall rates. To improve recall, a tableau method is used to recover constant patterns, and deep Q-learning is used to select semantically relevant predicates.

Rule discovery from large datasets is often prohibitively costly. The problem becomes more staggering when the rules are collectively defined across multiple tables. To scale with large datasets, this paper proposes a multi-round sampling strategy for rule discovery. We consider entity enhancing rules (RE Es) for collective entity resolution and conflict resolution, which may carry constant patterns and machine learning predicates. We sample large datasets with accuracy bounds alpha and beta such that at least alpha% of rules discovered from samples are guaranteed to hold on the entire dataset (i.e., precision), and at least beta% of rules on the entire dataset can be mined from the samples (i.e., recall). We also quantify the connection between support and confidence of the rules on samples and their counterparts on the entire dataset. To scale with the number of tuple variables in collective rules, we adopt deep Q-learning to select semantically relevant predicates. To improve the recall, we develop a tableau method to recover constant patterns from the dataset. We parallelize the algorithm such that it guarantees to reduce runtime when more processors are used. Using real-life and synthetic data, we empirically verify that the method speeds up RE E discovery by 12.2 times with sample ratio 10% and recall 82%.

Parallel Rule Discovery from Large Datasets by Sampling

期刊

PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22)

出版社

ASSOC COMPUTING MACHINERY

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Parallel Rule Discovery from Large Datasets by Sampling

期刊

PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22)

出版社

ASSOC COMPUTING MACHINERY

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文