☆ 4.7 Article

An efficient entropy based dissimilarity measure to cluster categorical data

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE (2023)

Journal

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE

Volume 119, Issue -, Pages -

Publisher

PERGAMON-ELSEVIER SCIENCE LTD

DOI: 10.1016/j.engappai.2022.105795

Keywords

Distance metric; Dissimilarity metric for categorical data; Entropy based dissimilarity measure; Proximity measure for clustering; Dissimilarity measure for clustering

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Clustering is an unsupervised learning technique that discovers intrinsic groups in data based on proximity. This paper proposes a new distance metric for computing dissimilarity between categorical data points. Experimental results show the efficacy of the proposed metric in handling complex real datasets.

Clustering is an unsupervised learning technique that discovers intrinsic groups based on proximity between data points. Therefore, the performance of clustering techniques mainly relies on the proximity measures used to compute the (dis)similarity between the data objects. In general, it is relatively easier to compute the distance between numerical data points as numerical operations can directly be applied to values along features. However, for categorical datasets, computing the (dis)similarity between the data objects becomes a non-trivial problem. Therefore, in this paper, we propose a new distance metric based on the information theoretic approach to compute the dissimilarity between categorical data points. We compute entropy along each feature to capture the intra-attribute statistical information, based on which significance of attributes are decided during clustering. The proposed measure is free from any domain-dependent parameters and also does not rely on the distribution of data points. Experiment is conducted over diversified benchmark data sets, considering six competing proximity measures with three popular clustering algorithms and the clustering results are compared in terms of RI (Rand Index), ARI (Adjusted Rand Index), CA (Clustering Accuracy) and Cluster Discrimination Matrix (CDM). Over 85 percent of the data sets, the clustering accuracy of the proposed metric embedded with K-Mode and Weighted K-Mode outperforms its counterparts. Approximately, 0.2951 s is needed by the proposed metric to cluster a data set having 10,000 data points with 8 attributes and 2 clusters on a standard desktop machine. Overall, experimental results demonstrate the efficacy of the proposed metric to handle complex real datasets of different characteristics.

An efficient entropy based dissimilarity measure to cluster categorical data

Journal

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE

Publisher

PERGAMON-ELSEVIER SCIENCE LTD

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

An efficient entropy based dissimilarity measure to cluster categorical data

Journal

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE

Publisher

PERGAMON-ELSEVIER SCIENCE LTD

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper