4.5 Article

Short text clustering based on Pitman-Yor process mixture model

Journal

APPLIED INTELLIGENCE
Volume 48, Issue 7, Pages 1802-1812

Publisher

SPRINGER
DOI: 10.1007/s10489-017-1055-4

Keywords

LDA; Pitman-Yor process; Short text clustering

Funding

  1. Natural Science Foundation of Jiangsu Province of China [BK20170513, BK20161338]
  2. National Natural Science Foundation of China [61703362, 61402203]
  3. Natural Science Foundation of the Higher Education Institutions of Jiangsu Province of China [17KJB520045]
  4. Science and Technology Planning Project of Yangzhou of China [YZ2016238]

Ask authors/readers for more resources

For finding the appropriate number of clusters in short text clustering, models based on Dirichlet Multinomial Mixture (DMM) require the maximum possible cluster number before inferring the real number of clusters. However, it is difficult to choose a proper number as we do not know the true number of clusters in short texts beforehand. The cluster distribution in DMM based on Dirichlet process as prior goes down exponentially as the number of clusters increases. Therefore, we propose a novel model based on Pitman-Yor Process to capture the power-law phenomenon of the cluster distribution in the paper. Specifically, each text chooses one of the active clusters or a new cluster with probabilities derived from the Pitman-Yor Process Mixture model (PYPM). Discriminative words and nondiscriminative words are identified automatically to help enhance text clustering. Parameters are estimated efficiently by collapsed Gibbs sampling and experimental results show PYPM is robust and effective comparing with the state-of-the-art models.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available