☆ 4.7 Article

A fast parallel attribute reduction algorithm using Apache Spark

KNOWLEDGE-BASED SYSTEMS (2021)

Journal

KNOWLEDGE-BASED SYSTEMS

Volume 212, Issue -, Pages -

Publisher

ELSEVIER

DOI: 10.1016/j.knosys.2020.106582

Keywords

Rough sets; Big data; Parallel algorithm; Attribute reduction; Apache Spark

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

The paper proposed a novel parallel attribute reduction algorithm by considering the Apache Spark framework, which improved computing efficiency by designing core attribute decision strategy and batch processing strategy, speeding up the algorithm with three techniques, and achieving significant improvements in experiments.

Effective and fast attribute reduction algorithm on high-dimensional dataset is one of the most important issues of big data, and several parallel attribute reduction algorithms were implemented by using MapReduce. However, MapReduce is not suitable for iterative computing, which causes low calculation efficiency in many cases. In this paper, we proposed a novel parallel attribute reduction algorithm by considering the new generation distributed computing framework Apache Spark. First, the core attribute decision strategy is proposed to replace the traditional attribute significance calculation, and the number of iterations is reduced from vertical bar C vertical bar vertical bar R vertical bar-vertical bar R vertical bar(2)/2+vertical bar R vertical bar/2 to vertical bar C vertical bar (vertical bar C vertical bar represents the number of condition attributes and vertical bar R vertical bar represents the number of attributes in the reduct result). Furthermore, for high-dimensional datasets, we designed a batch processing strategy to reduce the number of iterations exponentially. Second, the proposed algorithm was speeded up with three techniques, including: (1) the network data transmission is minimized based on the localized operation; (2) a single cache iteration method is suggested to reduce disk I/O cost; (3) some calculations are skipped by an interruption strategy. In the experimental analysis, we succeeded with various types of real big datasets and random datasets in a real distributed computing environment and compared with the classic MapReduce-based parallel attribute reduction algorithm PAAR_PR in various aspects. Experimental conclusions proved that the computing efficiency of our algorithm has been improved by more than 98% compared to the classic parallel attribute reduction algorithm PAAR_PR. (C) 2020 Elsevier B.V. All rights reserved.

A fast parallel attribute reduction algorithm using Apache Spark

Journal

KNOWLEDGE-BASED SYSTEMS

Publisher

ELSEVIER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

A fast parallel attribute reduction algorithm using Apache Spark

Journal

KNOWLEDGE-BASED SYSTEMS

Publisher

ELSEVIER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper