☆ 4.5 Article

A Spark-based Apriori algorithm with reduced shuffle overhead

JOURNAL OF SUPERCOMPUTING (2021)

期刊

JOURNAL OF SUPERCOMPUTING

卷 77, 期 1, 页码 133-151

出版社

SPRINGER

DOI: 10.1007/s11227-020-03253-7

关键词

Apache Spark; Apriori algorithm; Large-scale datasets; Shuffle overhead

类别

Computer Science, Hardware & Architecture Computer Science, Theory & Methods Engineering, Electrical & Electronic

资金

IIT(ISM), Govt. of India, Dhanbad
Department of Computer Science & Engineering, Indian Institute of Technology (ISM), Dhanbad, India

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper introduces a Spark-based Apriori algorithm called SARSO, which improves efficiency by reducing shuffle overhead caused by RDD operations. The method restricts the movement of key-value pairs across cluster nodes, reducing necessary communication and synchronization overhead incurred by the Spark shuffle operation.

Mining frequent itemset is considered as a core activity to find association rules from transactional datasets. Among the different well-known approaches to find frequent itemsets, the Apriori algorithm is the earliest proposed. Many attempts have been made to adopt the Apriori algorithm for large-scale datasets. But the bottlenecks associated with Apriori like/such as repeated scans of the input dataset, generation of all the candidate itemsets prior to counting their support value, etc., reduce the effectiveness of Apriori for large-size datasets. When the data size is large, even distributed and parallel implementations of Apriori using the MapReduce framework does not perform well. This is due to the iterative nature of the algorithm that incurs high disk overhead. In each iteration, the input dataset is scanned that resides on disk, causing the high disk I/O. Apache Spark implementations of Apriori show better performance due to in-memory processing capabilities. It makes iterative scanning of datasets faster by keeping it in a memory abstraction called resilient distributed dataset (RDD). An RDD keeps datasets in the form of key-value pairs spread across the cluster nodes. RDD operations require these key-value pairs to be redistributed among cluster nodes in the course of processing. This redistribution or shuffle operation incurs communication and synchronization overhead. In this manuscript, we propose a novel approach, namely the Spark-based Apriori algorithm with reduced shuffle overhead (SARSO). It utilizes the benefits of Spark's parallel and distributed computing environment, and it is in-memory processing capabilities. It improves the efficiency further by reducing the shuffle overhead caused by RDD operations at each iteration. In other words, it restricts the movement of key-value pairs across the cluster nodes by using a partitioning method and hence reduces the necessary communication and synchronization overhead incurred by the Spark shuffle operation. Extensive experiments have been conducted to measure the performance of the SARSO on benchmark datasets and compared with an existing algorithm. Experimental results show that the SARSO has better performance in terms of running time and scalability.

A Spark-based Apriori algorithm with reduced shuffle overhead

期刊

JOURNAL OF SUPERCOMPUTING

出版社

SPRINGER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

A Spark-based Apriori algorithm with reduced shuffle overhead

期刊

JOURNAL OF SUPERCOMPUTING

出版社

SPRINGER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文