4.8 Article

Big data analysis using a parallel ensemble clustering architecture and an unsupervised feature selection approach

Publisher

ELSEVIER
DOI: 10.1016/j.jksuci.2022.11.016

Keywords

Ensemble clustering; Consensus selection; Cluster merit; Parallel clustering architecture

Ask authors/readers for more resources

Ensemble clustering, which combines the results of multiple clustering methods, is a challenging research direction in data mining. This study introduces a parallel hierarchical clustering approach using divide-and-conquer strategy to achieve faster and more efficient ensemble clustering. A cluster consensus selection approach is proposed, which selects a subset of primary clusters to participate in the final consensus based on sample-cluster and cluster-cluster similarity. The proposed scheme also incorporates an unsupervised feature selection approach to remove irrelevant features. Extensive evaluations on datasets show that the proposed scheme outperforms state-of-the-art clustering methods, improving average performance by 6% to 24%.
Ensemble clustering is known as a challenging research direction in data mining. The results of several individual clustering methods are combined to produce higher quality final clusters. This study introduces a parallel hierarchical clustering approach based on the divide-and-conquer strategy, which is an attempt to realize faster and more efficient ensemble clustering. Here, we propose a cluster consensus selection approach that selects a subset of meriting primary clusters to participate in the final consensus. Considering the sample-cluster and cluster-cluster similarity on the selected primary clusters, we form the final clusters based on the clusters clustering technique as a consensus function. In addition, the proposed scheme is equipped with an unsupervised feature selection approach to remove features that do not contribute significantly to clustering. Extensive evaluations have been performed on datasets of different dimensions from the University of California Irvine (UCI) machine learning repository. The simulation results guarantee the efficiency of the proposed scheme and improves the average performance between 6% and 24% compared to the state-of-the-art clustering methods.@2022 The Author(s). Published by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.8
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available