☆ 4.6 Article

Estimating Semantic Relatedness in Source Code

ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY (2015)

Journal

ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY

Volume 25, Issue 1, Pages -

Publisher

ASSOC COMPUTING MACHINERY

DOI: 10.1145/2824251

Keywords

Design; Experimentation; Theory; Semantic relatedness; information retrieval; clustering; latent semantics; information theory

Funding

Louisiana Board of Regents Research Competitiveness Subprogram (LA BoR-RCS) [LEQSF(2015-18)-RD-A-07]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Contemporary software engineering tools exploit semantic relations between individual code terms to aid in code analysis and retrieval tasks. Such tools employ word similarity methods, often used in natural language processing (NLP), to analyze the textual content of source code. However, the notion of similarity in source code is different from natural language. Source code often includes unnatural domain-specific terms (e.g., abbreviations and acronyms), and such terms might be related due to their structural relations rather than linguistic aspects. Therefore, applying natural language similarity methods to source code without adjustment can produce low-quality and error-prone results. Motivated by these observations, we systematically investigate the performance of several semantic-relatedness methods in the context of software. Our main objective is to identify the most effective semantic schemes in capturing association relations between source code terms. To provide an unbiased comparison, different methods are compared against human-generated relatedness information using terms from three software systems. Results show that corpus-based methods tend to outperform methods that exploit external sources of semantic knowledge. However, due to inherent code limitations, the performance of such methods is still suboptimal. To address these limitations, we propose Normalized Software Distance (NSD), an information-theoretic method that captures semantic relatedness in source code by exploiting the distributional cues of code terms across the system. NSD overcomes data sparsity and lack of context problems often associated with source code, achieving higher levels of resemblance to the human perception of relatedness at the term and the text levels of code.

Estimating Semantic Relatedness in Source Code

Journal

ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Estimating Semantic Relatedness in Source Code

Journal

ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper