4.5 Article

An optimal approach for text feature selection

Journal

COMPUTER SPEECH AND LANGUAGE
Volume 74, Issue -, Pages -

Publisher

ACADEMIC PRESS LTD- ELSEVIER SCIENCE LTD
DOI: 10.1016/j.csl.2022.101364

Keywords

Feature selection; Text categorization; Text mining; Data mining; Arabic text mining

Ask authors/readers for more resources

This paper proposes a new feature selection method called MFX, which optimally selects a subset of features by mathematically formulating the selection problem as an optimization problem. MFX considers both classification accuracy and feature discriminability, and has two distinguishing features of treating all documents from the same category as one extended document and choosing discriminative terms that are frequent within the category and rare in other categories. Experimental results on various datasets demonstrate the superiority of MFX over other methods, and its performance is shown to outperform recent text classification algorithms based on neural networks and word embeddings when combined with the Support Vector Machine (SVM) classifier.
Traditionally, feature selection is conducted by first deriving a candidate list of features, then ranking and selecting the top features based on predefined threshold. These methods are highly dependent on the choice of the threshold, and therefore lead to sub-optimal text categorization results. In this paper, we address the selection problem by suggesting a one-step method designed to optimally select the subset of features. The selection is formulated mathematically as an optimization problem with the objective of maximizing classification accuracy while simultaneously deriving and choosing the most discriminative features. Our method, MFX, is applicable to many of the conventional methods, with two distinguishing aspects. First, it is based on considering all documents from the same category as one extended document, instead of analyzing individual documents. Second, it considers choosing the most discriminative terms that are frequent and common across all documents of the same category, and minimally present in other categories. Moreover, MFX is language-independent. It was tested on the well-known benchmark Reuters RCV1 dataset. To showcase its language independence, MFX was also tested on Arabic datasets extracted from Arabic news sources. The results indicated that MFX always performed similar to or better than other well-known feature selection methods. MFX with a Support Vector Machine (SVM) classifier was also shown to outperform recent text classification algorithms based on neural networks and word embeddings.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available