4.7 Article

Cluster Validation Method for Determining the Number of Clusters in Categorical Sequences

Journal

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TNNLS.2016.2608354

Keywords

Categorical sequences; cluster validation; cluster validity index (CVI); data clustering; model selection; robust clustering

Funding

  1. National Natural Science Foundation of China [61175123, 61672157]
  2. Natural Science Foundation of Fujian Province of China [2015J01238]
  3. U.S. National Science Foundation [CNS-1618629]
  4. Guangdong Province Fund [2013B091300019]
  5. Division Of Computer and Network Systems
  6. Direct For Computer & Info Scie & Enginr [1618629] Funding Source: National Science Foundation

Ask authors/readers for more resources

Cluster validation, which is the process of evaluating the quality of clustering results, plays an important role for practical machine learning systems. Categorical sequences, such as biological sequences in computational biology, have become common in real-world applications. Different from previous studies, which mainly focused on attribute-value data, in this paper, we work on the cluster validation problem for categorical sequences. The evaluation of sequences clustering is currently difficult due to the lack of an internal validation criterion defined with regard to the structural features hidden in sequences. To solve this problem, in this paper, a novel cluster validity index (CVI) is proposed as a function of clustering, with the intracluster structural compactness and intercluster structural separation linearly combined to measure the quality of sequence clusters. A partition-based algorithm for robust clustering of categorical sequences is also proposed, which provides the new measure with high-quality clustering results by the deterministic initialization and the elimination of noise clusters using an information theoretic method. The new clustering algorithm and the CVI are then assembled within the common model selection procedure to determine the number of clusters in categorical sequence sets. A case study on commonly used protein sequences and the experimental results on some real-world sequence sets from different domains are given to demonstrate the performance of the proposed method.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available