☆ 4.5 Article

The area under the ROC curve as a measure of clustering quality

DATA MINING AND KNOWLEDGE DISCOVERY (2022)

期刊

DATA MINING AND KNOWLEDGE DISCOVERY

卷 36, 期 3, 页码 1219-1245

出版社

SPRINGER

DOI: 10.1007/s10618-022-00829-0

关键词

Clustering validation; Area under the curve; Receiver operating characteristics; AUC/ROC; Area under the curve for clustering; Qualitative/visual clustering evaluation

类别

Computer Science, Artificial Intelligence Computer Science, Information Systems

资金

Brazilian research agencies FAPESP [2011/04247-5]
CNPq [302161/2017-1]
Interdisciplinary Center for Clinical Research (IZKF) Faculty of Medicine at the RWTH Aachen

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper explores the use of AUC as a performance measure in the unsupervised learning domain, specifically in cluster analysis. It discusses the use of AUC as an internal/relative measure of clustering quality, referred to as AUCC, and shows that AUCC has an expected value under a null model of random clustering solutions. It also reveals that AUCC is a linear transformation of the Gamma criterion.

The area under the receiver operating characteristics (ROC) Curve, referred to as AUC, is a well-known performance measure in the supervised learning domain. Due to its compelling features, it has been employed in a number of studies to evaluate and compare the performance of different classifiers. In this work, we explore AUC as a performance measure in the unsupervised learning domain, more specifically, in the context of cluster analysis. In particular, we elaborate on the use of AUC as an internal/relative measure of clustering quality, which we refer to as Area Under the Curve for Clustering (AUCC). We show that the AUCC of a given candidate clustering solution has an expected value under a null model of random clustering solutions, regardless of the size of the dataset and, more importantly, regardless of the number or the (im)balance of clusters under evaluation. In addition, we elaborate on the fact that, in the context of internal/relative clustering validation as we consider, AUCC is actually a linear transformation of the Gamma criterion from Baker and Hubert (1975), for which we also formally derive a theoretical expected value for chance clusterings. We also discuss the computational complexity of these criteria and show that, while an ordinary implementation of Gamma can be computationally prohibitive and impractical for most real applications of cluster analysis, its equivalence with AUCC actually unveils a much more efficient algorithmic procedure. Our theoretical findings are supported by experimental results. These results show that, in addition to an effective and robust quantitative evaluation provided by AUCC, visual inspection of the ROC curves themselves can be useful to further assess a candidate clustering solution from a broader, qualitative perspective as well.

The area under the ROC curve as a measure of clustering quality

期刊

DATA MINING AND KNOWLEDGE DISCOVERY

出版社

SPRINGER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

The area under the ROC curve as a measure of clustering quality

期刊

DATA MINING AND KNOWLEDGE DISCOVERY

出版社

SPRINGER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文