☆ 4.7 Article

Dirichlet Process Mixture Model for Document Clustering with Feature Partition

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (2013)

Journal

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Volume 25, Issue 8, Pages 1748-1759

Publisher

IEEE COMPUTER SOC

DOI: 10.1109/TKDE.2012.27

Keywords

Database management; database applications-text mining; pattern recognition; clustering document clustering; Dirichlet process mixture model; feature partition

Funding

National Natural Science Foundation of China [61202089, 11071128, 11131002]
Science and Technology Fund of Guizhou Province [2172]
Hong Kong Polytechnic University [A-PJ72]
Doctoral Fund of Ministry of Education of China [20110031110002]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Finding the appropriate number of clusters to which documents should be partitioned is crucial in document clustering. In this paper, we propose a novel approach, namely DPMFP, to discover the latent cluster structure based on the DPM model without requiring the number of clusters as input. Document features are automatically partitioned into two groups, in particular, discriminative words and nondiscriminative words, and contribute differently to document clustering. A variational inference algorithm is investigated to infer the document collection structure as well as the partition of document words at the same time. Our experiments indicate that our proposed approach performs well on the synthetic data set as well as real data sets. The comparison between our approach and state-of-the-art document clustering approaches shows that our approach is robust and effective for document clustering.

Dirichlet Process Mixture Model for Document Clustering with Feature Partition

Journal

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Dirichlet Process Mixture Model for Document Clustering with Feature Partition

Journal

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper