☆ 4.7 Article

How to Use K-means for Big Data Clustering?

PATTERN RECOGNITION (2023)

Journal

PATTERN RECOGNITION

Volume 137, Issue -, Pages -

Publisher

ELSEVIER SCI LTD

DOI: 10.1016/j.patcog.2022.109269

Keywords

Big data; Clustering; Minimum sum -of -squares; Divide and conquer algorithm; Decomposition; K -means; K -means plus plus; Global optimization; Unsupervised learning

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

K-means plays a vital role in data mining and is the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model. However, its performance drastically drops when applied to vast amounts of data. Therefore, it is crucial to improve K-means by scaling it to big data using as few of the following computational resources as possible: data, time, and algorith-mic ingredients. We propose a new parallel scheme of using K-means and K-means++ algorithms for big data clustering that satisfies the properties of a true big data algorithm and outperforms the classical and recent state-of-the-art MSSC approaches in terms of solution quality and runtime. The new approach naturally implements global search by decomposing the MSSC problem without using additional meta -heuristics. This work shows that data decomposition is the basic approach to solve the big data clustering problem. The empirical success of the new algorithm allowed us to challenge the common belief that more data is required to obtain a good clustering solution. Moreover, the present work questions the es-tablished trend that more sophisticated hybrid approaches and algorithms are required to obtain a better clustering solution.(c) 2022 Elsevier Ltd. All rights reserved.

How to Use K-means for Big Data Clustering?

Journal

PATTERN RECOGNITION

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

How to Use K-means for Big Data Clustering?

Journal

PATTERN RECOGNITION

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper