☆ 4.5 Article

A cost analysis of machine learning using dynamic runtime opcodes for malware detection

COMPUTERS & SECURITY (2019)

Journal

COMPUTERS & SECURITY

Volume 85, Issue -, Pages 138-155

Publisher

ELSEVIER ADVANCED TECHNOLOGY

DOI: 10.1016/j.cose.2019.04.018

Keywords

Malicious code; Network security; Machine learning; Computer security; Malware

Funding

EPSRC [CSIT 2 EP/N508664/1]
EPSRC [EP/K003445/1, EP/R007187/1, EP/K004379/1, EP/N508664/1] Funding Source: UKRI

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

The ongoing battle between malware distributors and those seeking to prevent the onslaught of malicious code has, so far, favored the former. Anti-virus methods are faltering with the rapid evolution and distribution of new malware, with obfuscation and detection evasion techniques exacerbating the issue. Recent research has monitored low-level opcodes to detect malware. Such dynamic analysis reveals the code at runtime, allowing the true behaviour to be examined. While previous research uses machine learning techniques to accurately detect malware using dynamic runtime opcodes, underpinning datasets have been poorly sampled and inadequate in size. Further, the datasets are always fixed size and no attempt, to our knowledge, has been made to examine the cost of retraining malware classification models on datasets which grow continually. In the literature, researchers discuss the explosion of malware, yet opcode analyses have used fixed-size datasets, with no deference to how this model will cope with retraining on escalating datasets. The research presented here examines this problem, and makes several novel contributions to the current body of knowledge. First, the performance of 23 machine learning algorithms are investigated with respect to the largest run trace dataset in the literature. Second, following an extensive hyperparameter selection process, the performance of each classifier is compared, on both accuracy and computational costs (CPU time). Lastly, the cost of retraining and testing updatable and non-updatable classifiers, both parallelized and non-parallelized, is examined with simulated escalating datasets. This provides insight into how implemented malware classifiers would perform, given simulated dataset escalation. We find that parallelized RandomForest, using 4 cores, provides the optimal performance, with high accuracy and low training and testing times. (C) 2019 Elsevier Ltd. All rights reserved.

A cost analysis of machine learning using dynamic runtime opcodes for malware detection

Journal

COMPUTERS & SECURITY

Publisher

ELSEVIER ADVANCED TECHNOLOGY

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

A cost analysis of machine learning using dynamic runtime opcodes for malware detection

Journal

COMPUTERS & SECURITY

Publisher

ELSEVIER ADVANCED TECHNOLOGY

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper