4.5 Article

t-Test feature selection approach based on term frequency for text categorization

Journal

PATTERN RECOGNITION LETTERS
Volume 45, Issue -, Pages 1-10

Publisher

ELSEVIER
DOI: 10.1016/j.patrec.2014.02.013

Keywords

Feature selection; Term frequency; Student t-test; Text classification

Funding

  1. State Key Laboratory of Software Development Environment [SKLSDE-2013ZX-36]

Ask authors/readers for more resources

Feature selection techniques play an important role in text categorization (TC), especially for the largescale TC tasks. Many new and improved methods have been proposed, and most of them are based on document frequency, such as the famous Chi-square statistic and information gain etc. These methods based on document frequency, however, have two shortcomings: (1) they are not reliable for low-frequency terms, that is, low-frequency terms will be filtered because of their smaller weights; and (2) they only count whether one term occurs within a document and ignore term frequency. Actually, high-frequency term (except stop words) occurred in few documents is often regards as a discriminators in the real-life corpus. Aimed at solving the above drawbacks, the paper focuses on how to construct a feature selection function based on term frequency, and proposes a new approach using student t-test. The t-test function is used to measure the diversity of the distributions of a term frequency between the specific category and the entire corpus. Extensive comparative experiments on two text corpora using three classifiers show that the proposed approach is comparable to the state-of-the-art feature selection methods in terms of macro-F1 and micro-F1. Especially on micro-F1, our method achieves slightly better performance on Reuters with kNN and SVMs classifiers, compared to x(2), and IG. (C) 2014 Elsevier B.V. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available