4.7 Article

Joint unsupervised contrastive learning and robust GMM for text clustering

期刊

出版社

ELSEVIER SCI LTD
DOI: 10.1016/j.ipm.2023.103529

关键词

Text clustering; Contrastive learning; Negative sampling; Gaussian mixture model; Expectation maximization

向作者/读者索取更多资源

Text clustering aims to organize a vast collection of documents into meaningful and coherent clusters. This study proposes a model named JourTC, which combines unsupervised contrastive learning and robust GMM for text clustering. Experimental results demonstrate that JourTC can be seamlessly integrated into existing frameworks and significantly improve accuracy.
Text clustering aims to organize a vast collection of documents into meaningful and coherent clusters, thereby facilitating the extraction of valuable insights. While current frameworks for text clustering try to minimize the anisotropy of pre-trained language models through contrastive learning of text embeddings, the approach of treating in-batch samples as negatives is suboptimal. The K-means algorithm offers a way to sample both hard negatives and false negatives. However, relying solely on a single measure of semantic similarity between distribu-tions and using coarse-grained weighting for negative pairs may potentially limit performance. Furthermore, considering the very similar distribution in text clusters due to rich semantics, the Mahalanobis distance-based Gaussian Mixture Model (GMM) is prone to falling into local optima due to one Gaussian model, having a smaller weight, may gradually merging into another during the parameter evaluation by the EM algorithm. To tackle these challenges, we propose a model named JourTC: Joint unsupervised contrastive learning and robust GMM for Text Clustering. In the contrastive learning phase, hard negatives, potential false negatives, and their corresponding global similarity-aware weights are determined through posterior probabilities derived from a Robust GMM (RGMM). This RGMM utilizes the entropy of each individual Gaussian model as a metric and adaptively adjusts the posterior probabilities of samples based on the Gaussian models with both maximum and minimum entropy to diminish the influence of low-entropy Gaussian models. Extensive experiments have shown that JourTC can be seamlessly integrated into existing text clustering frameworks, leading to a notable improvement in accuracy. Our code is publicly available.1

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据