4.7 Article

Pkg2Vec: Hierarchical package embedding for code authorship attribution

出版社

ELSEVIER
DOI: 10.1016/j.future.2020.10.020

关键词

Source code authorship attribution; Code embedding; Hierarchical neural networks

向作者/读者索取更多资源

This paper introduces a novel approach for software package authorship attribution called Pkg2Vec, based on a hierarchical deep neural network architecture, which better reflects real-world scenarios where code is organized in packages and written by teams. By utilizing a hierarchical neural network model and resilient features like keywords and API calls, Pkg2Vec outperforms other approaches in a large dataset of public packages.
Authorship attribution of software is the task of identifying the author of a given piece of code. Code attribution is of importance in multiple scenarios, ranging from software plagiarism to cybersecurity. In this paper, we introduce authorship attribution of software packages that better reflect real-world scenarios in which code is organized in packages and written by teams. We present a novel approach for software package authorship attribution called Pkg2Vec, based on a hierarchical deep neural network (DNN) architecture, corresponding to the hierarchical nature of software (code) packages. The hierarchical neural network model consists of a token level encoder and an attention mechanism for a function level encoder, together producing package embedding. Beyond package embedding, we use keywords and API calls as resilient features, which reflect the programmer's intention and style. Pkg2Vec is evaluated on a large dataset of public packages and compared to a number of other source code authorship attribution state-of-the-art algorithms. We find that Pkg2Vec significantly outperforms other approaches, achieving a 13% improvement in accuracy (C) 2020 Elsevier B.V. All rights reserved.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据