☆ 4.7 Article

Hierarchical Density-Based Clustering Using MapReduce

IEEE TRANSACTIONS ON BIG DATA (2021)

Journal

IEEE TRANSACTIONS ON BIG DATA

Volume 7, Issue 1, Pages 102-114

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TBDATA.2019.2907624

Keywords

Clustering algorithms; Partitioning algorithms; Programming; Data models; Machine learning algorithms; Big Data; Computational modeling; Density-based hierarchical clustering; MapReduce; big data

Funding

NESRC
CAPES
CNPq (PQ Grant) [304137/2013-8]
CNPq (PVE CSF Grant) [400772/2014-0]
FAPESP
FAPEMIG

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Hierarchical density-based clustering is a powerful tool for data analysis, but its applicability to large datasets is limited by computational complexity. MapReduce is a programming model used to speed up data mining and machine learning algorithms, however, parallelizing hierarchical clustering algorithms with MapReduce is inherently difficult.

Hierarchical density-based clustering is a powerful tool for exploratory data analysis, which can play an important role in the understanding and organization of datasets. However, its applicability to large datasets is limited because the computational complexity of hierarchical clustering methods has a quadratic lower bound in the number of objects to be clustered. MapReduce is a popular programming model to speed up data mining and machine learning algorithms operating on large, possibly distributed datasets. In the literature, there have been attempts to parallelize algorithms such as Single-Linkage, which in principle can also be extended to the broader scope of hierarchical density-based clustering, but hierarchical clustering algorithms are inherently difficult to parallelize with MapReduce. In this paper, we discuss why adapting previous approaches to parallelize Single-Linkage clustering using MapReduce leads to very inefficient solutions when one wants to compute density-based clustering hierarchies. Preliminarily, we discuss one such solution, which is based on an exact, yet very computationally demanding, random blocks parallelization scheme. To be able to efficiently apply hierarchical density-based clustering to large datasets using MapReduce, we then propose a different parallelization scheme that computes an approximate clustering hierarchy based on a much faster, recursive sampling approach. This approach is based on HDBSCAN*, the state-of-the-art hierarchical density-based clustering algorithm, combined with a data summarization technique called data bubbles. The proposed method is evaluated in terms of both runtime and quality of the approximation on a number of datasets, showing its effectiveness and scalability.

Hierarchical Density-Based Clustering Using MapReduce

Journal

IEEE TRANSACTIONS ON BIG DATA

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Hierarchical Density-Based Clustering Using MapReduce

Journal

IEEE TRANSACTIONS ON BIG DATA

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper