3.8 Proceedings Paper

A Hybrid Geometric Approach for Measuring Similarity Level Among Documents and Document Clustering

出版社

IEEE
DOI: 10.1109/BigDataService.2016.14

关键词

Document Clustering; Similarity Level; Document Similarity; Geometric Similarity; VSM

向作者/读者索取更多资源

The increasing numbers of textual documents from diverse sources such as different websites (e.g. social networks, news, magazines, blogs and medical recommendation websites), publications and articles and medical prescriptions leads to massive amounts of daily complex data. This phenomenon has caused many researchers to focus on analysing the content and measuring the similarities among the documents and texts to cluster them. One popular method to measure the similarity between documents is to represent the documents as vectors and measure the similarity among them based on the angle or Euclidean distance between each pair. By only considering these two criteria for similarity measurement, we may miss important underlying similarities in this area. We propose a new method, TS-SS, to measure the similarity level among documents, in such a way that one hopes to better understand which documents are more (or less) similar. This similarity level can be used as a handy measure for clustering and recommendation systems for documents. It also can be used to show top n similar documents to a particular document or a search query. Our study gives insights on the drawbacks of geometrical and non-geometrical similarity measures and provides a novel method to combine the other geometric criteria into a method to measure the similarity level among documents from new prospective. We apply Euclidean distance, Cosine similarity and our new method on four labelled datasets. Finally we report how these three geometrical similarity measures perform in terms of similarity level and clustering purity using four evaluation techniques. The evaluations' results show that our new model outperforms the other measures.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

3.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据