☆ 4.7 Article

Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

SCIENTIFIC REPORTS (2021)

Journal

SCIENTIFIC REPORTS

Volume 11, Issue 1, Pages -

Publisher

NATURE PORTFOLIO

DOI: 10.1038/s41598-021-83340-8

Keywords

Funding

Agence Nationale de la Recherche [ANR-15-RHUS-0004: RHU FIGHT-HF]
CPER IT2MP (Contrat Plan Etat Region, Innovations Technologiques, Modelisation & Medecine Personnalisee)
FEDER (Fonds Europeen de Developpement Regional)
RHU-Region Lorraine doctoral fellowship

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

The text discusses the selection of the most appropriate unsupervised machine-learning method for heterogeneous data, comparing model-based methods to distance/dissimilarity-based methods. It concludes that model-based methods typically outperform distance-based methods for mixed data.

The choice of the most appropriate unsupervised machine-learning method for heterogeneous or mixed data, i.e. with both continuous and categorical variables, can be challenging. Our aim was to examine the performance of various clustering strategies for mixed data using both simulated and real-life data. We conducted a benchmark analysis of ready-to-use tools in R comparing 4 model-based (Kamila algorithm, Latent Class Analysis, Latent Class Model [LCM] and Clustering by Mixture Modeling) and 5 distance/dissimilarity-based (Gower distance or Unsupervised Extra Trees dissimilarity followed by hierarchical clustering or Partitioning Around Medoids, K-prototypes) clustering methods. Clustering performances were assessed by Adjusted Rand Index (ARI) on 1000 generated virtual populations consisting of mixed variables using 7 scenarios with varying population sizes, number of clusters, number of continuous and categorical variables, proportions of relevant (non-noisy) variables and degree of variable relevance (low, mild, high). Clustering methods were then applied on the EPHESUS randomized clinical trial data (a heart failure trial evaluating the effect of eplerenone) allowing to illustrate the differences between different clustering techniques. The simulations revealed the dominance of K-prototypes, Kamila and LCM models over all other methods. Overall, methods using dissimilarity matrices in classical algorithms such as Partitioning Around Medoids and Hierarchical Clustering had a lower ARI compared to model-based methods in all scenarios. When applying clustering methods to a real-life clinical dataset, LCM showed promising results with regard to differences in (1) clinical profiles across clusters, (2) prognostic performance (highest C-index) and (3) identification of patient subgroups with substantial treatment benefit. The present findings suggest key differences in clustering performance between the tested algorithms (limited to tools readily available in R). In most of the tested scenarios, model-based methods (in particular the Kamila and LCM packages) and K-prototypes typically performed best in the setting of heterogeneous data.

Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

Journal

SCIENTIFIC REPORTS

Publisher

NATURE PORTFOLIO

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

Journal

SCIENTIFIC REPORTS

Publisher

NATURE PORTFOLIO

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper