4.2 Article

Predicting technical debt from commit contents: reproduction and extension with automated feature selection

期刊

SOFTWARE QUALITY JOURNAL
卷 28, 期 4, 页码 1551-1579

出版社

SPRINGER
DOI: 10.1007/s11219-020-09520-3

关键词

Natural language processing; Latent Dirichlet allocation; Logistic regression; Word embeddings; Topic modeling; Data mining

资金

  1. University of Oulu
  2. Oulu University Hospital
  3. Infotech Oulu
  4. Academy of Finland [298020, 328058]
  5. Academy of Finland (AKA) [328058, 328058] Funding Source: Academy of Finland (AKA)

向作者/读者索取更多资源

Self-admitted technical debt refers to sub-optimal development solutions that are expressed in written code comments or commits. We reproduce and improve on a prior work by Yan et al. (2018) on detecting commits that introduce self-admitted technical debt. We use multiple natural language processing methods: Bag-of-Words, topic modeling, and word embedding vectors. We study 5 open-source projects. Our NLP approach uses logistic Lasso regression from Glmnet to automatically select best predictor words. A manually labeled dataset from prior work that identified self-admitted technical debt from code level commits serves as ground truth. Our approach achieves + 0.15 better area under the ROC curve performance than a prior work, when comparing only commit message features, and + 0.03 better result overall when replacing manually selected features with automatically selected words. In both cases, the improvement was statistically significant (p< 0.0001). Our work has four main contributions, which are comparing different NLP techniques for SATD detection, improved results over previous work, showing how to generate generalizable predictor words when using multiple repositories, and producing a list of words correlating with SATD. As a concrete result, we release a list of the predictor words that correlate positively with SATD, as well as our used datasets and scripts to enable replication studies and to aid in the creation of future classifiers.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.2
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据