3.8 Proceedings Paper

Android Malware Detection Through a Pre-trained Model for Code Understanding

Publisher

SPRINGER INTERNATIONAL PUBLISHING AG
DOI: 10.1007/978-3-031-21333-5_105

Keywords

Android; Malware; Pre-trained model; Embedding; CodeT5

Ask authors/readers for more resources

This study utilizes CodeT5 pre-trained language model to generate context and semantic aware embeddings for a better representation of the behavior of Android applications. It shows how these embeddings can be used to train a recurrent neural network for malware detection tasks, and presents promising results.
Despite the large number of approaches proposed for detecting malicious applications targeting platforms such as Android, malware continuously evolves in order to avoid its detection and reach the users. Likewise, malware detection engines are continuously improved, trying to detect the most modern malware. Most of these detection tools employ signatures or machine learning models, trained on thousands of features, such as API calls, permissions or using taint analysis, among many others, and using machine learning classification algorithms such as decision trees, ensemble methods or deep learning. However, the use of these features leads to biased models due to the use of limited datasets, without considering the real semantics (goals and intentions) of the malicious sample. In this paper, we conduct an initial study of the use of context and semantic aware embeddings generated with the CodeT5 pre-trained language model for a better representation of the behaviour of Android applications. After decompiling a sample to Java, it is possible to generate embeddings from chunks of the source code, generating a rich representation of the sample. We show how these embeddings can be used to train a recurrent neural network for malware detection tasks, evidencing promising results.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

3.8
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available