期刊
EXPERT SYSTEMS WITH APPLICATIONS
卷 126, 期 -, 页码 233-245出版社
PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.eswa.2019.02.030
关键词
Parallel subspace clustering; Multi-attribute weights; High dimension; Categorical data; MapReduce
类别
资金
- National Natural Science Foundation of P. R. China [61876122]
- Science and Technological Innovation Team of Shanxi Province [201805D131007]
- U.S. National Science Foundation [CCF-0845257]
There are two main reasons why traditional clustering schemes are incompetent for high-dimensional categorical data. First, traditional methods usually represent each cluster by all dimensions without difference; and second, traditional clustering methods only rely on an individual dimension of projection as an attribute's weight ignoring relevance among attributes. We solve these two problems by a MapReduce-based subspace clustering algorithm (called PUMA) using multi-attribute weights. The attribute subspaces are constructed in our PUMA by calculating an attribute-value weight based on the co-occurrence probability of attribute values among different dimensions. PUMA obtains sub-clusters corresponding to respective attribute subspaces from each computing node in parallel. Lastly, PUMA measures various scale clusters by applying the hierarchical clustering method to iteratively merge sub-clusters. We implement PUMA on a 24-node Hadoop cluster. Experimental results reveal that using multi-attribute weights with subspace clustering can achieve better clustering accuracy on both synthetic and real-world high dimensional datasets. Experimental results also show that PUMA achieves high performance in terms of extensibility, scalability and the nearly linear speedup with respect to number of nodes. Additionally, experimental results demonstrate that PUMA is reasonable, effective, and practical to expert systems such as knowledge acquisition, word sense disambiguation, automatic abstracting and recommender systems. (C) 2019 Elsevier Ltd. All rights reserved.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据