☆ 4.4 Article

Detecting non-natural language artifacts for de-noising bug reports

AUTOMATED SOFTWARE ENGINEERING (2022)

期刊

AUTOMATED SOFTWARE ENGINEERING

卷 29, 期 2, 页码 -

出版社

SPRINGER

DOI: 10.1007/s10515-022-00350-0

关键词

NLP; Bug reports; Issue tickets; Data cleaning; Artifact removal; De-noising

类别

Computer Science, Software Engineering

资金

Austrian Science Fund (FWF) - Austrian Science Fund (FWF) [P 32653-N]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This study proposes a machine learning-based approach to classify textual content into natural language and non-natural language artifacts at the line level. It demonstrates the use of data from GitHub issue trackers for training set generation and presents a custom preprocessing approach for artifact removal.

Textual documents produced in the software engineering process are a popular target for natural language processing (NLP) and information retrieval (IR) approaches. However, issue tickets often contain artifacts such as code snippets, log outputs and stack traces. These artifacts not only inflate the issue ticket sizes, but also can this noise constitute a real problem for some NLP approaches, and therefore has to be removed in the pre-processing of some approaches. In this paper, we present a machine learning based approach to classify textual content into natural language and non-natural language artifacts at line level. We show how data from GitHub issue trackers can be used for automated training set generation, and present a custom preprocessing approach for the task of artifact removal. The training sets are automatically created from Markdown annotated issue tickets and project documentation files. We use these generated training sets to train a Markdown agnostic model that is able to classify un-annotated content. We evaluate our approach on issue tickets from projects written in C++, Java, JavaScript, PHP, and Python. Our approach achieves ROC-AUC scores between 0.92 and 0.96 for language-specific models. A multi-language model trained on the issue tickets of all languages achieves ROC-AUC scores between 0.92 and 0.95. The provided models are intended to be used as noise reduction pre-processing steps for NLP and IR approaches working on issue tickets.

Detecting non-natural language artifacts for de-noising bug reports

期刊

AUTOMATED SOFTWARE ENGINEERING

出版社

SPRINGER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Detecting non-natural language artifacts for de-noising bug reports

期刊

AUTOMATED SOFTWARE ENGINEERING

出版社

SPRINGER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文