4.7 Article

Clustering approximation via a fusion of multiple random samples

期刊

INFORMATION FUSION
卷 101, 期 -, 页码 -

出版社

ELSEVIER
DOI: 10.1016/j.inffus.2023.101986

关键词

Clustering approximation; Distributed clustering ensemble; Multiple random samples; Automatic clustering; Ensemble learning

向作者/读者索取更多资源

This study proposes a new distributed clustering approximation framework for big data, which uses multiple random samples to compute an ensemble result and integrates component clustering results using two new methods. Experimental results demonstrate the accuracy in identifying the correct number of clusters and the better scalability, efficiency, and clustering stability of the proposed methods.
In big data clustering exploration, the situation is paradoxical because there is no prior or insufficient domain knowledge. Moreover, clustering a big dataset is a challenging task in distributed computing framework. To address this, we propose a new distributed clustering approximation framework for big data with quality guarantees. This innovative framework uses multiple disjoint random samples instead of a single random sample to compute an ensemble result as the estimation of the true result of the entire big dataset. To begin, we modeled a large dataset as a collection of random sample data blocks stored in a distributed file system. Henceafter, a subset of data blocks is randomly selected, and to generate the component clustering results, the serial clustering algorithm is executed in parallel on the distributed computing framework. In each selected random sample, the number of clusters and initial centroids is identified using a density peak-based I-niceDP clustering algorithm, and then the k-means sweep refines them. Since the random samples are disjoint and traditional consensus functions cannot be used, we propose two new methods, a graph similarity and a naturally inspired firefly-based algorithm, to integrate the component clustering results into the final ensemble result. The entire clustering process is displayed through systematic support, extensive measures of clusterability, and quality evaluation. The methods are verified in a series of experiments using synthetic and real-world datasets. Our comprehensive experimental results demonstrate that the proposed methods vividly (1) recognize the correct number of clusters by analyzing a subset of samples and (2) exhibit better scalability, efficiency, and clustering stability.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据