☆ 4.7 Article

Field independent probabilistic model for clustering multi-field documents

INFORMATION PROCESSING & MANAGEMENT (2009)

Journal

INFORMATION PROCESSING & MANAGEMENT

Volume 45, Issue 5, Pages 555-570

Publisher

ELSEVIER SCI LTD

DOI: 10.1016/j.ipm.2009.03.005

Keywords

Document clustering; Finite mixture model; Multivariate Bernoulli model; Multinomial model; Field independent clustering model

Funding

Startup Fund of Fudan University
Shanghai Committee of Science and Technology, China [08DZ2271800, 09DZ2272800]
State Key Lab of Bio-Organic & Natural Products Chemistry, CAS

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

We propose a new finite mixture model for clustering multiple-field documents, such as scientific literature with distinct fields: title, abstract, keywords, main text and references. This probabilistic model, which we call field independent clustering model (FICM), incorporates the distinct word distributions of each field to integrate the discriminative abilities of each field as well as to select the most suitable component probabilistic model for each field. We evaluated the performance of FICM by applying it to the problem of clustering three-field (title, abstract and MeSH) biomedical documents from TREC 2004 and 2005 Genomics tracks, and two-field (title and abstract) news reports from Reuters-21578. Experimental results showed that FICM outperformed the classical multinomial model and the multivariate Bernoulli model, being at a statistically significant level for all the three collections. These results indicate that FICM outperformed widely-used probabilistic models for document clustering by considering the characteristics of each field. We further showed that the component model, which is consistent with the nature of the corresponding field, achieved a better performance and considering the diversity of model setting also gave a further performance improvement. An extended abstract of parts of the work presented in this paper has appeared in Zhu et al. [Zhu, S., Takigawa, L, Zhang, S., & Mamitsuka, H. (2007). A probabilistic model for clustering text documents with multiple fields. In Proceedings of the 29th European conference on information retrieval, ECIR 2007. Lecture notes in computer science (Vol. 4425, pp. 331-342)]. (C) 2009 Elsevier Ltd. All rights reserved.

Field independent probabilistic model for clustering multi-field documents

Journal

INFORMATION PROCESSING & MANAGEMENT

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Field independent probabilistic model for clustering multi-field documents

Journal

INFORMATION PROCESSING & MANAGEMENT

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper