☆ 4.4 Article

Automating Large-Scale Data Quality Verification

PROCEEDINGS OF THE VLDB ENDOWMENT (2018)

期刊

PROCEEDINGS OF THE VLDB ENDOWMENT

卷 11, 期 12, 页码 1781-1794

出版社

ASSOC COMPUTING MACHINERY

DOI: 10.14778/3229863.3229867

关键词

类别

Computer Science, Information Systems Computer Science, Theory & Methods

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Modern companies and institutions rely on data to guide every single business process and decision. Missing or incorrect information seriously compromises any decision process downstream. Therefore, a crucial, but tedious task for everyone involved in data processing is to verify the quality of their data. We present a system for automating the verification of data quality at scale, which meets the requirements of production use cases. Our system provides a declarative API, which combines common quality constraints with userdefined validation code, and thereby enables 'unit tests' for data. We efficiently execute the resulting constraint validation workload by translating it to aggregation queries on Apache Spark. Our platform supports the incremental validation of data quality on growing datasets, and leverages machine learning, e.g., for enhancing constraint suggestions, for estimating the 'predictability' of a column, and for detecting anomalies in historic data quality time series. We discuss our design decisions, describe the resulting system architecture, and present an experimental evaluation on various datasets.

Automating Large-Scale Data Quality Verification

期刊

PROCEEDINGS OF THE VLDB ENDOWMENT

出版社

ASSOC COMPUTING MACHINERY

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Automating Large-Scale Data Quality Verification

期刊

PROCEEDINGS OF THE VLDB ENDOWMENT

出版社

ASSOC COMPUTING MACHINERY

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文