4.6 Article

Estimating Semantic Relatedness in Source Code

Publisher

ASSOC COMPUTING MACHINERY
DOI: 10.1145/2824251

Keywords

Design; Experimentation; Theory; Semantic relatedness; information retrieval; clustering; latent semantics; information theory

Funding

  1. Louisiana Board of Regents Research Competitiveness Subprogram (LA BoR-RCS) [LEQSF(2015-18)-RD-A-07]

Ask authors/readers for more resources

Contemporary software engineering tools exploit semantic relations between individual code terms to aid in code analysis and retrieval tasks. Such tools employ word similarity methods, often used in natural language processing (NLP), to analyze the textual content of source code. However, the notion of similarity in source code is different from natural language. Source code often includes unnatural domain-specific terms (e.g., abbreviations and acronyms), and such terms might be related due to their structural relations rather than linguistic aspects. Therefore, applying natural language similarity methods to source code without adjustment can produce low-quality and error-prone results. Motivated by these observations, we systematically investigate the performance of several semantic-relatedness methods in the context of software. Our main objective is to identify the most effective semantic schemes in capturing association relations between source code terms. To provide an unbiased comparison, different methods are compared against human-generated relatedness information using terms from three software systems. Results show that corpus-based methods tend to outperform methods that exploit external sources of semantic knowledge. However, due to inherent code limitations, the performance of such methods is still suboptimal. To address these limitations, we propose Normalized Software Distance (NSD), an information-theoretic method that captures semantic relatedness in source code by exploiting the distributional cues of code terms across the system. NSD overcomes data sparsity and lack of context problems often associated with source code, achieving higher levels of resemblance to the human perception of relatedness at the term and the text levels of code.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available