4.6 Article

Large Scale Evaluation of Natural Language Processing Based Test-to-Code Traceability Approaches

Journal

IEEE ACCESS
Volume 9, Issue -, Pages 79089-79104

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/ACCESS.2021.3083923

Keywords

Large scale integration; Testing; Task analysis; Production; Syntactics; Semantics; Maintenance engineering; Software testing; unit testing; test-to-code traceability; natural language processing; word embedding; latent semantic indexing

Funding

  1. New National Excellence Programs [UNKP-20-3-SZTE, UNKP-20-5-SZTE]
  2. Ministry of Innovation and Technology, Hungary [NKFIH-1279-2/2020N]
  3. Artificial Intelligence National Laboratory Programme of the National Research, Development and Innovation (NRDI) Office of the Ministry of Innovation and Technology
  4. Janos Bolyai Scholarship of the Hungarian Academy of Sciences

Ask authors/readers for more resources

This paper investigates the applicability of text-based methods in software engineering for traceability purposes. It discusses the advantages and disadvantages of text-based methods, as well as the potential of combining different techniques to achieve test-to-code traceability even without following naming conventions.
Traceability information can be crucial for software maintenance, testing, automatic program repair, and various other software engineering tasks. Customarily, a vast amount of test code is created for systems to maintain and improve software quality. Today's test systems may contain tens of thousands of tests. Finding the parts of code tested by each test case is usually a difficult and time-consuming task without the help of the authors of the tests or at least clear naming conventions. Recent test-to-code traceability research has employed various approaches but textual methods as standalone techniques were investigated only marginally. The naming convention approach is a well-regarded method among developers. Besides their often only voluntary use, however, one of its main weaknesses is that it can only identify one-to-one links. With the use of more versatile text-based methods, candidates could be ranked by similarity, thus producing a number of possible connections. Textual methods also have their disadvantages, even machine learning techniques can only provide semantically connected links from the text itself, these can be refined with the incorporation of structural information. In this paper, we investigate the applicability of three text-based methods both as a standalone traceability link recovery technique and regarding their combination possibilities with each other and with naming conventions. The paper presents an extensive evaluation of these techniques using several source code representations and meta-parameter settings on eight real, medium-sized software systems with a combined size of over 1.25 million lines of code. Our results suggest that with suitable settings, text-based approaches can be used for test-to-code traceability purposes, even where naming conventions were not followed.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available