4.5 Article

A hybrid reciprocal model of PCA and K-means with an innovative approach of considering sub-datasets for the improvement of K-means initialization and step-by-step labeling to create clusters with high interpretability

Journal

PATTERN ANALYSIS AND APPLICATIONS
Volume 24, Issue 3, Pages 1387-1402

Publisher

SPRINGER
DOI: 10.1007/s10044-021-00977-x

Keywords

K-means; PCA; Reciprocal relationship; Step-by-step labeling; Interpretability

Ask authors/readers for more resources

The study explores the relationship between K-means algorithm and PCA, proposing two new methods, K-P and P-K, to improve the interpretability and results of clustering by creating sub-datasets and applying step-by-step labeling. Experimental results on a human resource dataset show that P-K method outperforms K-P method in terms of interpretability and run time.
The K-means algorithm is a popular clustering method, which is sensitive to the initialization of samples and selecting the number of clusters. Its performance on high-dimensional datasets is considerably influenced. Principal component analysis (PCA) is a linear dimensionless reduction method that is closely related to the K-means algorithm. Dimension reduction leads to the selection of initial centers in a smaller space, which is a solution to solve initialization problems. The present study investigates the reciprocal relationship between K-means and PCA and adopts an innovative approach of creating sub-datasets and applying step-by-step labeling in the hybrid execution of both algorithms to propose two methods, namely K-P and P-K. The clusters that are obtained from the two proposed methods are of high interpretability. This was verified by the step-by-step labeling results of a human resource dataset. Interpretability was evaluated via the distribution of features of interest (FoI), suggesting improved results for both datasets. In addition to the improvement of the qualitative results, the outcome of the present study showed the sum of squared estimate of errors (SSE)/N (total number of data) and silhouette improvement of 10 datasets with eight initialization methods in previous studies. The P-K results and run time were better than the K-P ones.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available