☆ 4.6 Article

Feature selection using cloud-based parallel genetic algorithm for intrusion detection data classification

NEURAL COMPUTING & APPLICATIONS (2021)

Journal

NEURAL COMPUTING & APPLICATIONS

Volume 33, Issue 18, Pages 11861-11873

Publisher

SPRINGER LONDON LTD

DOI: 10.1007/s00521-021-05871-5

Keywords

Parallel genetic algorithm; Machine learning; Feature selection; Intrusion detection systems

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This study migrates genetic algorithm-based feature selection methods to a MapReduce implementation suitable for parallelization on a high number of commodity hardware units, achieving significant practical and theoretical impact. The parallelization of genetic algorithm allows for randomness-enhanced feature selection, reducing overall data preprocessing time and leading to better feature selection, outperforming existing methods in practice.

With the exponential growth of the amount of data being generated, stored and processed on a daily basis in the machine learning, data analytics and decision-making systems, the data preprocessing established itself as the key factor for building reliable high-performance machine learning models. One of the roles in preprocessing is variable reduction using feature selection methods; however, the processing time needed for these methods is a major drawback. This study aims at mitigating this problem by migrating the algorithm to a MapReduce implementation suitable for parallelization on a high number of commodity hardware units. The genetic algorithm-based methods were put in the focus of this study. Hadoop, an open-source MapReduce library, was used as a framework for implementing parallel genetic algorithms within our research. The representative machine learning methods, SVM (support vector machine), ANN (artificial neural network), RT (random tree), logistic regression and Naive Bayes, were embedded into implementation for feature selection. The feature selection methods were applied to four NSL-KDD data sets, and the number of features is reduced from cca 40 to cca 10 data sets with the accuracy of 90.45%. These results have both significant practical and theoretical impact. On the one hand, the genetic algorithm has been parallelized in the MapReduce manner, which has been considered unachievable in a strict sense. Furthermore, the genetic algorithm allows randomness-enhanced feature selection and its parallelization reduces overall data preprocessing and allows larger population count which in turn leads to better feature selection. On the practical side, it has been shown that this implementation outperforms the existing feature selection methods.

Feature selection using cloud-based parallel genetic algorithm for intrusion detection data classification

Journal

NEURAL COMPUTING & APPLICATIONS

Publisher

SPRINGER LONDON LTD

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Feature selection using cloud-based parallel genetic algorithm for intrusion detection data classification

Journal

NEURAL COMPUTING & APPLICATIONS

Publisher

SPRINGER LONDON LTD

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper