期刊
EXPERT SYSTEMS WITH APPLICATIONS
卷 38, 期 4, 页码 3085-3090出版社
PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.eswa.2010.08.100
关键词
Nonparametric statistics; Text mining; Machine learning
类别
资金
- National Science Council of Taiwan [97-2221-E-001-014-MY3, NSC95-2416-H-346-002]
In this paper, we propose using chi-square statistics to measure similarities and chi-square tests to determine the homogeneity of two random samples of term vectors for text categorization. The properties of chi-square tests for text categorization are studied first. One of the advantages of chi-square test is that its significance level is similar to the miss rate that provides a foundation for theoretical performance (i.e. miss rate) guarantee. Generally a classifier using cosine similarities with IF I* IDF performs reasonably well in text categorization. However, its performance may fluctuate even near the optimal threshold value. To improve the limitation, we propose the combined usage of chi-square statistics and cosine similarities. Extensive experiment results verify properties of chi-square tests and performance of the combined usage. (C) 2010 Elsevier Ltd. All rights reserved.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据