☆ 4.7 Article

Using chi-square statistics to measure similarities for text categorization

EXPERT SYSTEMS WITH APPLICATIONS (2011)

期刊

EXPERT SYSTEMS WITH APPLICATIONS

卷 38, 期 4, 页码 3085-3090

出版社

PERGAMON-ELSEVIER SCIENCE LTD

DOI: 10.1016/j.eswa.2010.08.100

关键词

Nonparametric statistics; Text mining; Machine learning

类别

Computer Science, Artificial Intelligence Engineering, Electrical & Electronic Operations Research & Management Science

资金

National Science Council of Taiwan [97-2221-E-001-014-MY3, NSC95-2416-H-346-002]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

In this paper, we propose using chi-square statistics to measure similarities and chi-square tests to determine the homogeneity of two random samples of term vectors for text categorization. The properties of chi-square tests for text categorization are studied first. One of the advantages of chi-square test is that its significance level is similar to the miss rate that provides a foundation for theoretical performance (i.e. miss rate) guarantee. Generally a classifier using cosine similarities with IF I* IDF performs reasonably well in text categorization. However, its performance may fluctuate even near the optimal threshold value. To improve the limitation, we propose the combined usage of chi-square statistics and cosine similarities. Extensive experiment results verify properties of chi-square tests and performance of the combined usage. (C) 2010 Elsevier Ltd. All rights reserved.

Using chi-square statistics to measure similarities for text categorization

期刊

EXPERT SYSTEMS WITH APPLICATIONS

出版社

PERGAMON-ELSEVIER SCIENCE LTD

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Using chi-square statistics to measure similarities for text categorization

期刊

EXPERT SYSTEMS WITH APPLICATIONS

出版社

PERGAMON-ELSEVIER SCIENCE LTD

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文