☆ 4.7 Article

Clustering approximation via a fusion of multiple random samples

INFORMATION FUSION (2024)

期刊

INFORMATION FUSION

卷 101, 期 -, 页码 -

出版社

ELSEVIER

DOI: 10.1016/j.inffus.2023.101986

关键词

Clustering approximation; Distributed clustering ensemble; Multiple random samples; Automatic clustering; Ensemble learning

类别

Computer Science, Artificial Intelligence Computer Science, Theory & Methods

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This study proposes a new distributed clustering approximation framework for big data, which uses multiple random samples to compute an ensemble result and integrates component clustering results using two new methods. Experimental results demonstrate the accuracy in identifying the correct number of clusters and the better scalability, efficiency, and clustering stability of the proposed methods.

In big data clustering exploration, the situation is paradoxical because there is no prior or insufficient domain knowledge. Moreover, clustering a big dataset is a challenging task in distributed computing framework. To address this, we propose a new distributed clustering approximation framework for big data with quality guarantees. This innovative framework uses multiple disjoint random samples instead of a single random sample to compute an ensemble result as the estimation of the true result of the entire big dataset. To begin, we modeled a large dataset as a collection of random sample data blocks stored in a distributed file system. Henceafter, a subset of data blocks is randomly selected, and to generate the component clustering results, the serial clustering algorithm is executed in parallel on the distributed computing framework. In each selected random sample, the number of clusters and initial centroids is identified using a density peak-based I-niceDP clustering algorithm, and then the k-means sweep refines them. Since the random samples are disjoint and traditional consensus functions cannot be used, we propose two new methods, a graph similarity and a naturally inspired firefly-based algorithm, to integrate the component clustering results into the final ensemble result. The entire clustering process is displayed through systematic support, extensive measures of clusterability, and quality evaluation. The methods are verified in a series of experiments using synthetic and real-world datasets. Our comprehensive experimental results demonstrate that the proposed methods vividly (1) recognize the correct number of clusters by analyzing a subset of samples and (2) exhibit better scalability, efficiency, and clustering stability.

Clustering approximation via a fusion of multiple random samples

期刊

INFORMATION FUSION

出版社

ELSEVIER

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Clustering approximation via a fusion of multiple random samples

期刊

INFORMATION FUSION

出版社

ELSEVIER

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文