☆ 4.5 Article

An optimal approach for text feature selection

COMPUTER SPEECH AND LANGUAGE (2022)

Journal

COMPUTER SPEECH AND LANGUAGE

Volume 74, Issue -, Pages -

Publisher

ACADEMIC PRESS LTD- ELSEVIER SCIENCE LTD

DOI: 10.1016/j.csl.2022.101364

Keywords

Feature selection; Text categorization; Text mining; Data mining; Arabic text mining

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper proposes a new feature selection method called MFX, which optimally selects a subset of features by mathematically formulating the selection problem as an optimization problem. MFX considers both classification accuracy and feature discriminability, and has two distinguishing features of treating all documents from the same category as one extended document and choosing discriminative terms that are frequent within the category and rare in other categories. Experimental results on various datasets demonstrate the superiority of MFX over other methods, and its performance is shown to outperform recent text classification algorithms based on neural networks and word embeddings when combined with the Support Vector Machine (SVM) classifier.

Traditionally, feature selection is conducted by first deriving a candidate list of features, then ranking and selecting the top features based on predefined threshold. These methods are highly dependent on the choice of the threshold, and therefore lead to sub-optimal text categorization results. In this paper, we address the selection problem by suggesting a one-step method designed to optimally select the subset of features. The selection is formulated mathematically as an optimization problem with the objective of maximizing classification accuracy while simultaneously deriving and choosing the most discriminative features. Our method, MFX, is applicable to many of the conventional methods, with two distinguishing aspects. First, it is based on considering all documents from the same category as one extended document, instead of analyzing individual documents. Second, it considers choosing the most discriminative terms that are frequent and common across all documents of the same category, and minimally present in other categories. Moreover, MFX is language-independent. It was tested on the well-known benchmark Reuters RCV1 dataset. To showcase its language independence, MFX was also tested on Arabic datasets extracted from Arabic news sources. The results indicated that MFX always performed similar to or better than other well-known feature selection methods. MFX with a Support Vector Machine (SVM) classifier was also shown to outperform recent text classification algorithms based on neural networks and word embeddings.

An optimal approach for text feature selection

Journal

COMPUTER SPEECH AND LANGUAGE

Publisher

ACADEMIC PRESS LTD- ELSEVIER SCIENCE LTD

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

An optimal approach for text feature selection

Journal

COMPUTER SPEECH AND LANGUAGE

Publisher

ACADEMIC PRESS LTD- ELSEVIER SCIENCE LTD

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper