☆ 4.7 Review

K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data

INFORMATION SCIENCES (2023)

Journal

INFORMATION SCIENCES

Volume 622, Issue -, Pages 178-210

Publisher

ELSEVIER SCIENCE INC

DOI: 10.1016/j.ins.2022.11.139

Keywords

K-means; K-means variants; Clustering algorithm; Modified k-means; Improved k-means; Perspectives on big data clustering; Big data clustering

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Advances in data collection techniques have enabled the accumulation of large quantities of data. The K-means algorithm, while popular, has challenges such as determining the number of clusters and detecting non-Euclidean shapes. Research efforts have been made to improve its performance and robustness.

Advances in recent techniques for scientific data collection in the era of big data allow for the systematic accumulation of large quantities of data at various data-capturing sites. Similarly, exponential growth in the development of different data analysis approaches has been reported in the literature, amongst which the K-means algorithm remains the most popular and straightforward clustering algorithm. The broad applicability of the algo-rithm in many clustering application areas can be attributed to its implementation simplic-ity and low computational complexity. However, the K-means algorithm has many challenges that negatively affect its clustering performance. In the algorithm's initialization process, users must specify the number of clusters in a given dataset apriori while the ini-tial cluster centers are randomly selected. Furthermore, the algorithm's performance is susceptible to the selection of this initial cluster and for large datasets, determining the optimal number of clusters to start with becomes complex and is a very challenging task. Moreover, the random selection of the initial cluster centers sometimes results in minimal local convergence due to its greedy nature. A further limitation is that certain data object features are used in determining their similarity by using the Euclidean distance metric as a similarity measure, but this limits the algorithm's robustness in detecting other cluster shapes and poses a great challenge in detecting overlapping clusters. Many research efforts have been conducted and reported in literature with regard to improving the K-means algorithm's performance and robustness. The current work presents an overview and tax-onomy of the K-means clustering algorithm and its variants. The history of the K-means, current trends, open issues and challenges, and recommended future research perspectives are also discussed.(c) 2022 Elsevier Inc. All rights reserved.

K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data

Journal

INFORMATION SCIENCES

Publisher

ELSEVIER SCIENCE INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data

Journal

INFORMATION SCIENCES

Publisher

ELSEVIER SCIENCE INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper