4.7 Article

An efficient entropy based dissimilarity measure to cluster categorical data

Journal

Publisher

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.engappai.2022.105795

Keywords

Distance metric; Dissimilarity metric for categorical data; Entropy based dissimilarity measure; Proximity measure for clustering; Dissimilarity measure for clustering

Ask authors/readers for more resources

Clustering is an unsupervised learning technique that discovers intrinsic groups in data based on proximity. This paper proposes a new distance metric for computing dissimilarity between categorical data points. Experimental results show the efficacy of the proposed metric in handling complex real datasets.
Clustering is an unsupervised learning technique that discovers intrinsic groups based on proximity between data points. Therefore, the performance of clustering techniques mainly relies on the proximity measures used to compute the (dis)similarity between the data objects. In general, it is relatively easier to compute the distance between numerical data points as numerical operations can directly be applied to values along features. However, for categorical datasets, computing the (dis)similarity between the data objects becomes a non-trivial problem. Therefore, in this paper, we propose a new distance metric based on the information theoretic approach to compute the dissimilarity between categorical data points. We compute entropy along each feature to capture the intra-attribute statistical information, based on which significance of attributes are decided during clustering. The proposed measure is free from any domain-dependent parameters and also does not rely on the distribution of data points. Experiment is conducted over diversified benchmark data sets, considering six competing proximity measures with three popular clustering algorithms and the clustering results are compared in terms of RI (Rand Index), ARI (Adjusted Rand Index), CA (Clustering Accuracy) and Cluster Discrimination Matrix (CDM). Over 85 percent of the data sets, the clustering accuracy of the proposed metric embedded with K-Mode and Weighted K-Mode outperforms its counterparts. Approximately, 0.2951 s is needed by the proposed metric to cluster a data set having 10,000 data points with 8 attributes and 2 clusters on a standard desktop machine. Overall, experimental results demonstrate the efficacy of the proposed metric to handle complex real datasets of different characteristics.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available