☆ 4.5 Article

Android malware dataset construction methodology to minimize bias-variance tradeoff

ICT EXPRESS (2022)

Journal

ICT EXPRESS

Volume 8, Issue 3, Pages 444-462

Publisher

ELSEVIER

DOI: 10.1016/j.icte.2021.10.001

Keywords

Android; Malware; Dataset; Bias; Variance; Underfitting; Overfitting; Opcode; Birthmark; Similarity digest hash; Dexofuzzy; N-gram; Clustering

Funding

Institute for Information & Communications Technology Promotion (IITP) - Korea government (MSIT) [2019-0-00026]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Research on categorization and detection of Android malware has focused on proposing different learned models and machine learning algorithms. This study examines dataset construction and proposes methods to determine bias and variance, as well as reduce labeling noise. The proposed method goes beyond existing methods by using unified labels and opcode morphology for constructing new types of datasets.

Recently, research on Android malware categorization and detection is increasingly directed toward proposing different learned models based on various features of Android apps and machine learning algorithms. For the implementation of such modeling, properly constructing a dataset is no less important than selecting a suitable algorithm. The present study examines dataset construction using Dexofuzzy and proposes methods to determine the degree of bias and variance in the process and minimize the noise in sample set labeling where there is a possibility that even the same samples can be differently labeled. The method proposed in the present study goes beyond existing dataset construction methods relying on label data provided by antivirus vendors to include an effective approach to construct new types of datasets built on unified labels combined with opcode morphology. Based on newly constructed datasets, a flexible dataset, which allows overfitting and underfitting to be considered, was obtained via N-Gram and M-Partial Matching. This flexible dataset was then subjected to clustering, and the resultant clustering performance was evaluated. (C) 2021 The Author(s). Published by Elsevier B.V. on behalf of The Korean Institute of Communications and Information Sciences.

Android malware dataset construction methodology to minimize bias-variance tradeoff

Journal

ICT EXPRESS

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Android malware dataset construction methodology to minimize bias-variance tradeoff

Journal

ICT EXPRESS

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper