☆ 4.6 Article

What You See is What it Means! Semantic Representation Learning of Code based on Visualization and Transfer Learning

ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY (2022)

Journal

ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY

Volume 31, Issue 2, Pages -

Publisher

ASSOC COMPUTING MACHINERY

DOI: 10.1145/3485135

Keywords

Semantic clones; embedding; visual representation; representation learning

Funding

Luxembourg National Research Fund (FNR) [14591304, 11693861]
Luxembourg Ministry of Foreign and European Affairs
European Research Council (ERC) under the European Union [949014]
European Research Council (ERC) [949014] Funding Source: European Research Council (ERC)

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This research proposes a novel embedding approach called WySiWiM, which leverages the visual patterns in source code and utilizes pre-trained image classification neural networks for transfer learning. The evaluation on various tasks demonstrates that the WySiWiM approach performs as effectively as state-of-the-art methods.

Recent successes in training word embeddings for Natural Language Processing (NLP) tasks have encouraged a wave of research on representation learning for source code, which builds on similar NLP methods. The overall objective is then to produce code embeddings that capture the maximum of program semantics. State-of-the-art approaches invariably rely on a syntactic representation (i.e., raw lexical tokens, abstract syntax trees, or intermediate representation tokens) to generate embeddings, which are criticized in the literature as non-robust or non-generalizable. In this work, we investigate a novel embedding approach based on the intuition that source code has visual patterns of semantics. We further use these patterns to address the outstanding challenge of identifying semantic code clones. We propose the WySiWiM (What You See Is What It Means) approach where visual representations of source code are fed into powerful pre-trained image classification neural networks from the field of computer vision to benefit from the practical advantages of transfer learning. We evaluate the proposed embedding approach on the task of vulnerable code prediction in source code and on two variations of the task of semantic code clone identification: code clone detection (a binary classification problem), and code classification (a multi-classification problem). We show with experiments on the BigCloneBench (Java), Open Judge (C) that although simple, our WYSIWIM approach performs as effectively as state-of-the-art approaches such as ASTNN or TBCNN. We also showed with data from NVD and SARD that WySiWiM representation can be used to learn a vulnerable code detector with reasonable performance (accuracy similar to 90%). We further explore the influence of different steps in our approach, such as the choice of visual representations or the classification algorithm, to eventually discuss the promises and limitations of this research direction.

What You See is What it Means! Semantic Representation Learning of Code based on Visualization and Transfer Learning

Journal

ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

What You See is What it Means! Semantic Representation Learning of Code based on Visualization and Transfer Learning

Journal

ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper