Journal
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
Volume 33, Issue 8, Pages 3075-3089Publisher
IEEE COMPUTER SOC
DOI: 10.1109/TKDE.2019.2962412
Keywords
Clustering; density-based clustering; hierarchical clustering; data mining; relative neighborhood graph
Categories
Funding
- NSERC, Canada
- CNPq, under the program Science without Borders, Brazil
Ask authors/readers for more resources
HDBSCAN*, a cutting-edge hierarchical clustering method based on density, can be challenging to optimize for the parameter mpts. This paper introduces an approach that allows for the efficient computation of multiple HDBSCAN* hierarchies for a range of mpts values, achieving over a hundred hierarchies for a computational cost equivalent to running HDBSCAN* twice.
HDBSCAN*, a state-of-the-art density-based hierarchical clustering method, produces a hierarchical organization of clusters in a dataset w.r.t. a parameter mpts. While a small change in mpts typically leads to a small change in the clustering structure, choosing a good mpts value can be challenging: depending on the data distribution, a high or low mpts value may be more appropriate, and certain clusters may reveal themselves at different values. To explore results for a range of mpts values, one has to run HDBSCAN* for each value independently, which can be computationally impractical. In this paper, we propose an approach to efficiently compute all HDBSCAN* hierarchies for a range of mpts values by building upon results from computational geometry to replace HDBSCAN*'s complete graph with a smaller equivalent graph. An experimental evaluation shows that our approach can obtain over one hundred hierarchies for the computational cost equivalent to running HDBSCAN* about twice, which corresponds to a speedup of more than 60 times, compared to running HDBSCAN* independently that many times. We also propose a series of visualizations that allow users to analyze a collection of hierarchies for a range of mpts values, along with case studies that illustrate how these analyses are performed.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available