☆ 4.7 Article

Determining the number of clusters using information entropy for mixed data

PATTERN RECOGNITION (2012)

Journal

PATTERN RECOGNITION

Volume 45, Issue 6, Pages 2251-2265

Publisher

ELSEVIER SCI LTD

DOI: 10.1016/j.patcog.2011.12.017

Keywords

Clustering; Mixed data; Number of clusters; Information entropy; Cluster validity index; k-Prototypes algorithm

Funding

National Natural Science Foundation of China [71031006, 70971080, 60970014]
Special Prophase Project on National Key Basic Research and Development Program of China (973) [2011CB311805]
Foundation of Doctoral Program Research of Ministry of Education of China [20101401110002]
Key Problems in Science and Technology Project of Shanxi [20110321027-01]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

In cluster analysis, one of the most challenging and difficult problems is the determination of the number of clusters in a data set, which is a basic input parameter for most clustering algorithms. To solve this problem, many algorithms have been proposed for either numerical or categorical data sets. However, these algorithms are not very effective for a mixed data set containing both numerical attributes and categorical attributes. To overcome this deficiency, a generalized mechanism is presented in this paper by integrating Renyi entropy and complement entropy together. The mechanism is able to uniformly characterize within-cluster entropy and between-cluster entropy and to identify the worst cluster in a mixed data set. In order to evaluate the clustering results for mixed data, an effective cluster validity index is also defined in this paper. Furthermore, by introducing a new dissimilarity measure into the k-prototypes algorithm, we develop an algorithm to determine the number of clusters in a mixed data set. The performance of the algorithm has been studied on several synthetic and real world data sets. The comparisons with other clustering algorithms show that the proposed algorithm is more effective in detecting the optimal number of clusters and generates better clustering results. (C) 2011 Elsevier Ltd. All rights reserved.

Determining the number of clusters using information entropy for mixed data

Journal

PATTERN RECOGNITION

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Determining the number of clusters using information entropy for mixed data

Journal

PATTERN RECOGNITION

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper