☆ 4.7 Article

A scalable and flexible basket analysis system for big transaction data in Spark

INFORMATION PROCESSING & MANAGEMENT (2024)

Journal

INFORMATION PROCESSING & MANAGEMENT

Volume 61, Issue 2, Pages -

Publisher

ELSEVIER SCI LTD

DOI: 10.1016/j.ipm.2023.103577

Keywords

Big transaction data; Frequent itemset mining; Parallel and distributed computing; Business basket analysis; Basket analysis systems

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper proposes a scalable distributed frequent itemset mining (ScaDistFIM) algorithm to address the data scalability and flexibility issues in basket analysis in the big data era. Experiment results demonstrate that the ScaDistFIM algorithm is more efficient compared to the Spark FP-Growth algorithm.

Basket analysis is a prevailing technique to help retailers uncover patterns and associations of sold products in customer shopping transactions. However, as the size of transaction databases grows, the traditional basket analysis techniques and systems become less effective because of two issues in the applications of the big data age: data scalability and flexibility to adapt different application tasks. This paper proposes a scalable distributed frequent itemset mining (ScaDistFIM) algorithm for basket analysis on big transaction data to solve these two problems. ScaDistFIM is performed in two stages. The first stage uses the FP-Growth algorithm to compute the local frequent itemsets from each random subset of the distributed transaction dataset, and all random subsets are computed in parallel. The second stage uses an approximation method to aggregate all local frequent itemsets to the final approximate set of frequent itemsets where the support values of the frequent itemsets are estimated. We further elaborate on implementing the ScaDistFIM algorithm and a flexible basket analysis system using Spark SQL queries to demonstrate the system's flexibility in real applications. The experiment results on synthetic and real-world transaction datasets demonstrate that compared to the Spark FP-Growth algorithm, the ScaDistFIM algorithm can achieve time savings of at least 90% while ensuring nearly 100% accuracy. Hence, the ScaDistFIM algorithm exhibits superior scalability. On dataset GenD with 1 billion records, the ScaDistFIM algorithm requires only 360 s to achieve 100% precision and recall. In contrast, due to memory limitations, Spark FP-Growth cannot complete the computation task.

A scalable and flexible basket analysis system for big transaction data in Spark

Journal

INFORMATION PROCESSING & MANAGEMENT

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

A scalable and flexible basket analysis system for big transaction data in Spark

Journal

INFORMATION PROCESSING & MANAGEMENT

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper