4.7 Article

Clustering approximation via a fusion of multiple random samples

Journal

INFORMATION FUSION
Volume 101, Issue -, Pages -

Publisher

ELSEVIER
DOI: 10.1016/j.inffus.2023.101986

Keywords

Clustering approximation; Distributed clustering ensemble; Multiple random samples; Automatic clustering; Ensemble learning

Ask authors/readers for more resources

This study proposes a new distributed clustering approximation framework for big data, which uses multiple random samples to compute an ensemble result and integrates component clustering results using two new methods. Experimental results demonstrate the accuracy in identifying the correct number of clusters and the better scalability, efficiency, and clustering stability of the proposed methods.
In big data clustering exploration, the situation is paradoxical because there is no prior or insufficient domain knowledge. Moreover, clustering a big dataset is a challenging task in distributed computing framework. To address this, we propose a new distributed clustering approximation framework for big data with quality guarantees. This innovative framework uses multiple disjoint random samples instead of a single random sample to compute an ensemble result as the estimation of the true result of the entire big dataset. To begin, we modeled a large dataset as a collection of random sample data blocks stored in a distributed file system. Henceafter, a subset of data blocks is randomly selected, and to generate the component clustering results, the serial clustering algorithm is executed in parallel on the distributed computing framework. In each selected random sample, the number of clusters and initial centroids is identified using a density peak-based I-niceDP clustering algorithm, and then the k-means sweep refines them. Since the random samples are disjoint and traditional consensus functions cannot be used, we propose two new methods, a graph similarity and a naturally inspired firefly-based algorithm, to integrate the component clustering results into the final ensemble result. The entire clustering process is displayed through systematic support, extensive measures of clusterability, and quality evaluation. The methods are verified in a series of experiments using synthetic and real-world datasets. Our comprehensive experimental results demonstrate that the proposed methods vividly (1) recognize the correct number of clusters by analyzing a subset of samples and (2) exhibit better scalability, efficiency, and clustering stability.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available