4.5 Article

Postimpact similarity: a similarity measure for effective grouping of unlabelled text using spectral clustering

Journal

KNOWLEDGE AND INFORMATION SYSTEMS
Volume 64, Issue 3, Pages 723-742

Publisher

SPRINGER LONDON LTD
DOI: 10.1007/s10115-022-01658-9

Keywords

Text clustering; Data clustering; Applied machine learning; Data mining

Ask authors/readers for more resources

The task of text clustering is to divide a set of text documents into meaningful groups based on their similarity. The content similarity between documents is commonly used to form clusters, but it may not be effective for large and high-dimensional corpora. This paper proposes a similarity measure using spectral method, which assigns scores based on the content similarity between documents and their individual similarity with shared neighbors. Experimental results show that this method outperforms existing text clustering techniques in terms of normalized mutual information, f-measure, and v-measure.
The task of text clustering is to partition a set of text documents into different meaningful groups such that the documents in a particular cluster are more similar to each other than the documents of other clusters according to a similarity or dissimilarity measure. Therefore, the role of similarity measure is crucial for producing good-quality clusters. The content similarity between two documents is generally used to form individual clusters, and it is measured by considering shared terms between the documents. However, the same may not be effective for a reasonably large and high-dimensional corpus. Therefore, a similarity measure is proposed here to improve the performance of text clustering using spectral method. The proposed similarity measure between two documents assigns a score based on their content similarity and their individual similarity with the shared neighbours over the corpus. The effectiveness of the proposed document similarity measure has been tested for clustering of different standard corpora using spectral clustering method. The empirical results using some well-known text collections have shown that the proposed method performs better than the state-of-the-art text clustering techniques in terms of normalized mutual information, f-measure and v-measure.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available