☆ 4.5 Article

EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data

KNOWLEDGE AND INFORMATION SYSTEMS (2020)

Journal

KNOWLEDGE AND INFORMATION SYSTEMS

Volume 62, Issue 9, Pages 3565-3583

Publisher

SPRINGER LONDON LTD

DOI: 10.1007/s10115-020-01464-1

Keywords

Frequent itemset mining; Apache Spark; Apriori algorithm; Large-scale datasets

Funding

Department of Computer Science and Engineering, Indian Institute of Technology (ISM), Dhanbad, India

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Frequent itemset mining is considered a popular tool to discover knowledge from transactional datasets. It also serves as the basis for association rule mining. Several algorithms have been proposed to find frequent patterns in which the apriori algorithm is considered as the earliest proposed. Apriori has two significant bottlenecks associated with it: first, repeated scanning of input dataset and second, the requirement of generation of all the candidate itemsets before counting its support value. These bottlenecks reduce the effectiveness of apriori for large-scale datasets. Reasonable efforts have been made to diminish these bottlenecks so that efficiency can be improved. Especially, when the data size is larger, even distributed and parallel environments like MapReduce does not perform well due to the iterative nature of the algorithm that incurs high disk overhead. Apache Spark, on the other hand, is gaining significant attention in the field of big data processing because of its in-memory processing capabilities. Apart from utilizing the parallel and distributed computing environment of Spark, the proposed scheme named efficient apriori-based frequent itemset mining (EAFIM) presents two novel methods to improve the efficiency further. Unlike apriori, it generates the candidates 'on-the-fly,' i.e., candidate generation, and count of its support values go simultaneously when the input dataset is being scanned. Also, instead of using the original input dataset in each iteration, it calculates the updated input dataset by removing useless items and transactions. Reduction in size of the input dataset for higher iterations enables EAFIM to perform better. Extensive experiments were conducted to analyze the efficiency and scalability of EAFIM, which outperforms other existing methodologies.

EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data

Journal

KNOWLEDGE AND INFORMATION SYSTEMS

Publisher

SPRINGER LONDON LTD

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data

Journal

KNOWLEDGE AND INFORMATION SYSTEMS

Publisher

SPRINGER LONDON LTD

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper