☆ 4.5 Article

Parallel Hierarchical Subspace Clustering of Categorical Data

IEEE TRANSACTIONS ON COMPUTERS (2019)

Journal

IEEE TRANSACTIONS ON COMPUTERS

Volume 68, Issue 4, Pages 542-555

Publisher

IEEE COMPUTER SOC

DOI: 10.1109/TC.2018.2879332

Keywords

Hierarchical subspace-clustering; LSH-based data partitioning; categorical data; Hadoop

Funding

National Natural Science Foundation of China [61876122]
U.S. National Science Foundation [IIS-1618669, CNS-0917137, CCF-0845257]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Parallel clustering is an important research area of big data analysis. The conventional Hierarchical Agglomerative Clustering (HAC) techniques are inadequate to handle big-scale categorical datasets due to two drawbacks. First, HAC consumes excessive CPU time and memory resources; and second, it is non-trivial to decompose clustering tasks into independent sub-tasks executed in parallel. We solve these two problems by a MapReduce-based hierarchical subspace-clustering algorithm - called PAPU - using LSH-based data partitioning. PAPU is conducive to partitioning a large-scale dataset into multiple independent sub-datasets, into which similar data objects are mapped. Advocating parallel computing, PAPU obtains sub-clusters corresponding to respective attribute subspaces from independent chunks in the local clustering phase. To improve the accuracy of approximated clustering results, PAPU measures various scale clusters by applying the hierarchical clustering scheme to iteratively merge sub-clusters during the global clustering phase. We implement PAPU on a 24-node Hadoop computing platform. The experimental results reveal that hierarchical subspace-clustering coupled with the data-partitioning strategy achieves high clustering efficiency on both synthetic and real-world large-scale datasets. The experiments also demonstrate that PAPU delivers superior performance in terms of extensibility and scalability (e.g., a nearly linear speedup).

Parallel Hierarchical Subspace Clustering of Categorical Data

Journal

IEEE TRANSACTIONS ON COMPUTERS

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Parallel Hierarchical Subspace Clustering of Categorical Data

Journal

IEEE TRANSACTIONS ON COMPUTERS

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper