4.7 Article

A fast parallel attribute reduction algorithm using Apache Spark

Journal

KNOWLEDGE-BASED SYSTEMS
Volume 212, Issue -, Pages -

Publisher

ELSEVIER
DOI: 10.1016/j.knosys.2020.106582

Keywords

Rough sets; Big data; Parallel algorithm; Attribute reduction; Apache Spark

Ask authors/readers for more resources

The paper proposed a novel parallel attribute reduction algorithm by considering the Apache Spark framework, which improved computing efficiency by designing core attribute decision strategy and batch processing strategy, speeding up the algorithm with three techniques, and achieving significant improvements in experiments.
Effective and fast attribute reduction algorithm on high-dimensional dataset is one of the most important issues of big data, and several parallel attribute reduction algorithms were implemented by using MapReduce. However, MapReduce is not suitable for iterative computing, which causes low calculation efficiency in many cases. In this paper, we proposed a novel parallel attribute reduction algorithm by considering the new generation distributed computing framework Apache Spark. First, the core attribute decision strategy is proposed to replace the traditional attribute significance calculation, and the number of iterations is reduced from vertical bar C vertical bar vertical bar R vertical bar-vertical bar R vertical bar(2)/2+vertical bar R vertical bar/2 to vertical bar C vertical bar (vertical bar C vertical bar represents the number of condition attributes and vertical bar R vertical bar represents the number of attributes in the reduct result). Furthermore, for high-dimensional datasets, we designed a batch processing strategy to reduce the number of iterations exponentially. Second, the proposed algorithm was speeded up with three techniques, including: (1) the network data transmission is minimized based on the localized operation; (2) a single cache iteration method is suggested to reduce disk I/O cost; (3) some calculations are skipped by an interruption strategy. In the experimental analysis, we succeeded with various types of real big datasets and random datasets in a real distributed computing environment and compared with the classic MapReduce-based parallel attribute reduction algorithm PAAR_PR in various aspects. Experimental conclusions proved that the computing efficiency of our algorithm has been improved by more than 98% compared to the classic parallel attribute reduction algorithm PAAR_PR. (C) 2020 Elsevier B.V. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available