☆ 4.5 Article

A hybrid reciprocal model of PCA and K-means with an innovative approach of considering sub-datasets for the improvement of K-means initialization and step-by-step labeling to create clusters with high interpretability

PATTERN ANALYSIS AND APPLICATIONS (2021)

Journal

PATTERN ANALYSIS AND APPLICATIONS

Volume 24, Issue 3, Pages 1387-1402

Publisher

SPRINGER

DOI: 10.1007/s10044-021-00977-x

Keywords

K-means; PCA; Reciprocal relationship; Step-by-step labeling; Interpretability

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

The study explores the relationship between K-means algorithm and PCA, proposing two new methods, K-P and P-K, to improve the interpretability and results of clustering by creating sub-datasets and applying step-by-step labeling. Experimental results on a human resource dataset show that P-K method outperforms K-P method in terms of interpretability and run time.

The K-means algorithm is a popular clustering method, which is sensitive to the initialization of samples and selecting the number of clusters. Its performance on high-dimensional datasets is considerably influenced. Principal component analysis (PCA) is a linear dimensionless reduction method that is closely related to the K-means algorithm. Dimension reduction leads to the selection of initial centers in a smaller space, which is a solution to solve initialization problems. The present study investigates the reciprocal relationship between K-means and PCA and adopts an innovative approach of creating sub-datasets and applying step-by-step labeling in the hybrid execution of both algorithms to propose two methods, namely K-P and P-K. The clusters that are obtained from the two proposed methods are of high interpretability. This was verified by the step-by-step labeling results of a human resource dataset. Interpretability was evaluated via the distribution of features of interest (FoI), suggesting improved results for both datasets. In addition to the improvement of the qualitative results, the outcome of the present study showed the sum of squared estimate of errors (SSE)/N (total number of data) and silhouette improvement of 10 datasets with eight initialization methods in previous studies. The P-K results and run time were better than the K-P ones.

A hybrid reciprocal model of PCA and K-means with an innovative approach of considering sub-datasets for the improvement of K-means initialization and step-by-step labeling to create clusters with high interpretability

Journal

PATTERN ANALYSIS AND APPLICATIONS

Publisher

SPRINGER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

A hybrid reciprocal model of PCA and K-means with an innovative approach of considering sub-datasets for the improvement of K-means initialization and step-by-step labeling to create clusters with high interpretability

Journal

PATTERN ANALYSIS AND APPLICATIONS

Publisher

SPRINGER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper