☆ 4.7 Article

Cluster Validation Method for Determining the Number of Clusters in Categorical Sequences

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (2017)

Journal

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Volume 28, Issue 12, Pages 2936-2948

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TNNLS.2016.2608354

Keywords

Categorical sequences; cluster validation; cluster validity index (CVI); data clustering; model selection; robust clustering

Funding

National Natural Science Foundation of China [61175123, 61672157]
Natural Science Foundation of Fujian Province of China [2015J01238]
U.S. National Science Foundation [CNS-1618629]
Guangdong Province Fund [2013B091300019]
Division Of Computer and Network Systems
Direct For Computer & Info Scie & Enginr [1618629] Funding Source: National Science Foundation

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Cluster validation, which is the process of evaluating the quality of clustering results, plays an important role for practical machine learning systems. Categorical sequences, such as biological sequences in computational biology, have become common in real-world applications. Different from previous studies, which mainly focused on attribute-value data, in this paper, we work on the cluster validation problem for categorical sequences. The evaluation of sequences clustering is currently difficult due to the lack of an internal validation criterion defined with regard to the structural features hidden in sequences. To solve this problem, in this paper, a novel cluster validity index (CVI) is proposed as a function of clustering, with the intracluster structural compactness and intercluster structural separation linearly combined to measure the quality of sequence clusters. A partition-based algorithm for robust clustering of categorical sequences is also proposed, which provides the new measure with high-quality clustering results by the deterministic initialization and the elimination of noise clusters using an information theoretic method. The new clustering algorithm and the CVI are then assembled within the common model selection procedure to determine the number of clusters in categorical sequence sets. A case study on commonly used protein sequences and the experimental results on some real-world sequence sets from different domains are given to demonstrate the performance of the proposed method.

Cluster Validation Method for Determining the Number of Clusters in Categorical Sequences

Journal

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Cluster Validation Method for Determining the Number of Clusters in Categorical Sequences

Journal

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper