4.5 Article

Optimal feature configuration for dynamic malware detection

Journal

COMPUTERS & SECURITY
Volume 105, Issue -, Pages -

Publisher

ELSEVIER ADVANCED TECHNOLOGY
DOI: 10.1016/j.cose.2021.102250

Keywords

Machine learning; Malware detection; Feature engineering; Performance evaluation; Statistical inference; Dynamic analysis

Funding

  1. Spanish National Cybersecurity Institute (INCIBE)

Ask authors/readers for more resources

This research investigates the use of machine learning techniques to extract features from API calls for malware detection. By testing different combinations of feature sets, evaluating model performance on unbalanced datasets, and analyzing the results to determine the optimal feature set, an excellent malware detection model was obtained.
Applying machine learning techniques to malware detection is a common approach to try to overcome the limitations of signature-based methods. However, it is difficult to engineer a set of features that characterizes the samples properly, especially when various file types may be a vector of infection. In this work, we configure several feature sets for dynamic malware detection extracted from API calls, including an alternative scheme grouping calls in categories, network activity, signatures from the Cuckoo sandbox report, and some interactions with the file system and registry. We test combinations of these feature sets to ascertain whether they are good enough to distinguish between benign and malicious samples from a dataset containing several file types, obtained from public sources. We apply statistical inference to measure the differences in the performance between the feature sets, and the hyperparameter optimization algorithms applied to construct the models. We also unbalance the datasets to evaluate the model performance on more realistic scenarios in which not many malware samples are available. Although all studied feature configurations provide accuracies greater than 0.98, and several of them a Matthews correlation coefficient greater than 0.95 in the unbalanced datasets, statistically meaningful differences appear, so we analyze the results to determine which is the optimal set of features. We obtain a model that achieves an accuracy of 0.9937 in the balanced dataset and a Matthews correlation coefficient of 0.964 in the unbalanced dataset with 5% of malware. (c) 2021 Elsevier Ltd. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available