☆ 4.6 Article

Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification

PEERJ COMPUTER SCIENCE (2022)

期刊

PEERJ COMPUTER SCIENCE

卷 8, 期 -, 页码 -

出版社

PEERJ INC

DOI: 10.7717/peerj-cs.961

关键词

Chi-square; Document frequency; Extra tree classifier; Feature selection; Genetic algorithm; Information gain; Text classification

类别

Computer Science, Artificial Intelligence Computer Science, Information Systems Computer Science, Theory & Methods

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Text classification categorizes documents based on their content into predefined categories. Selecting appropriate features is crucial when dealing with a large number of features. This paper presents a hybrid feature selection method combining document frequency and genetic algorithm for Amharic text classification, which outperforms other methods and improves classification accuracy when combined with Extra Tree Classifier.

Text classification is the process of categorizing documents based on their content into a predefined set of categories. Text classification algorithms typically represent documents as collections of words and it deals with a large number of features. The selection of appropriate features becomes important when the initial feature set is quite large. In this paper, we present a hybrid of document frequency (DF) and genetic algorithm (GA)-based feature selection method for Amharic text classification. We evaluate this feature selection method on Amharic news documents obtained from the Ethiopian News Agency (ENA). The number of categories used in this study is 13. Our experimental results showed that the proposed feature selection method outperformed other feature selection methods utilized for Amharic news document classification. Combining the proposed feature selection method with Extra Tree Classifier (ETC) improves classification accuracy. It improves classification accuracy up to 1% higher than the hybrid of DF, information gain (IG), chi-square (CHI), and principal component analysis (PCA), 2.47% greater than GA and 3.86% greater than a hybrid of DF, IG, and CHI.

Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification

期刊

PEERJ COMPUTER SCIENCE

出版社

PEERJ INC

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification

期刊

PEERJ COMPUTER SCIENCE

出版社

PEERJ INC

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文