4.5 Article

Combining semantic and term frequency similarities for text clustering

期刊

KNOWLEDGE AND INFORMATION SYSTEMS
卷 61, 期 3, 页码 1485-1516

出版社

SPRINGER LONDON LTD
DOI: 10.1007/s10115-018-1278-7

关键词

Document clustering; Similarity measure; Semantic similarity; Text mining

资金

  1. Brazilian Research Agency Coordenacao de Aperfeicoamento de Pessoal de Nivel Superior (CAPES) [001]
  2. Brazilian Research Agency CNPq
  3. Brazilian Research Agency FAPEMIG
  4. Brazilian Research Agency FAPESP
  5. Natural Sciences and Engineering Research Council of Canada
  6. Boeing Company
  7. International Development Research Centre, Ottawa, Canada
  8. CALDO

向作者/读者索取更多资源

A key challenge for document clustering consists in finding a proper similarity measure for text documents that enables the generation of cohesive groups. Measures based on the classic bag-of-words model take into account solely the presence (and frequency) of words in documents. In doing so, semantically similar documents which use different vocabularies may end up in different clusters. For this reason, semantic similarity measures that use external knowledge, such as word n-gram corpora or thesauri, have been proposed in the literature. In this paper, the Frequency Google Tri-gram Measure is proposed to assess similarity between documents based on the frequencies of terms in the compared documents as well as the Google n-gram corpus as an additional semantic similarity source. Clustering algorithms are applied to several real datasets in order to experimentally evaluate the quality of the clusters obtained with the proposed measure and compare it with a number of state-of-the-art measures from the literature. The experimental results demonstrate that the proposed measure improves significantly the quality of document clustering, based on statistical tests. We further demonstrate that clustering results combining bag-of-words and semantic similarity are superior to those obtained with either approach independently.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据