4.6 Article

Feature selection using cloud-based parallel genetic algorithm for intrusion detection data classification

Journal

NEURAL COMPUTING & APPLICATIONS
Volume 33, Issue 18, Pages 11861-11873

Publisher

SPRINGER LONDON LTD
DOI: 10.1007/s00521-021-05871-5

Keywords

Parallel genetic algorithm; Machine learning; Feature selection; Intrusion detection systems

Ask authors/readers for more resources

This study migrates genetic algorithm-based feature selection methods to a MapReduce implementation suitable for parallelization on a high number of commodity hardware units, achieving significant practical and theoretical impact. The parallelization of genetic algorithm allows for randomness-enhanced feature selection, reducing overall data preprocessing time and leading to better feature selection, outperforming existing methods in practice.
With the exponential growth of the amount of data being generated, stored and processed on a daily basis in the machine learning, data analytics and decision-making systems, the data preprocessing established itself as the key factor for building reliable high-performance machine learning models. One of the roles in preprocessing is variable reduction using feature selection methods; however, the processing time needed for these methods is a major drawback. This study aims at mitigating this problem by migrating the algorithm to a MapReduce implementation suitable for parallelization on a high number of commodity hardware units. The genetic algorithm-based methods were put in the focus of this study. Hadoop, an open-source MapReduce library, was used as a framework for implementing parallel genetic algorithms within our research. The representative machine learning methods, SVM (support vector machine), ANN (artificial neural network), RT (random tree), logistic regression and Naive Bayes, were embedded into implementation for feature selection. The feature selection methods were applied to four NSL-KDD data sets, and the number of features is reduced from cca 40 to cca 10 data sets with the accuracy of 90.45%. These results have both significant practical and theoretical impact. On the one hand, the genetic algorithm has been parallelized in the MapReduce manner, which has been considered unachievable in a strict sense. Furthermore, the genetic algorithm allows randomness-enhanced feature selection and its parallelization reduces overall data preprocessing and allows larger population count which in turn leads to better feature selection. On the practical side, it has been shown that this implementation outperforms the existing feature selection methods.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available