☆ 4.4 Article

Evaluating software user feedback classifier performance on unseen apps, datasets, and metadata

EMPIRICAL SOFTWARE ENGINEERING (2023)

Journal

EMPIRICAL SOFTWARE ENGINEERING

Volume 28, Issue 2, Pages -

Publisher

SPRINGER

DOI: 10.1007/s10664-022-10254-y

Keywords

Software user feedback; User feedback classification; Unseen data domains; Machine learning; requirements engineering; Software quality

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Understanding user needs is crucial for high quality software. This study evaluates machine learning classifiers' performance on feedback for bug reports and feature requests. The results show that using channel-specific metadata as features does not significantly improve classification performance, and classifiers do not perform well on unseen datasets. Rating: 8 out of 10.

Understanding users' needs is crucial to building and maintaining high quality software. Online software user feedback has been shown to contain large amounts of information useful to requirements engineering (RE). Previous studies have created machine learning classifiers for parsing this feedback for development insight. While these classifiers report generally good performance when evaluated on a test set, questions remain as to how well they extend to unseen data in various forms. This study evaluates machine learning classifiers' performance on feedback for two common classification tasks (classifying bug reports and feature requests). Using seven datasets from prior research studies, we investigate the performance of classifiers when evaluated on feedback from different apps than those contained in the training set and when evaluated on completely different datasets (coming from different feedback channels and/or labelled by different researchers). We also measure the difference in performance of using channel-specific metadata as a feature in classification. We find that using metadata as features in classifying bug reports and feature requests does not lead to a statistically significant improvement in the majority of datasets tested. We also demonstrate that classification performance is similar on feedback from unseen apps compared to seen apps in the majority of cases tested. However, the classifiers evaluated do not perform well on unseen datasets. We show that multi-dataset training or zero shot classification approaches can somewhat mitigate this performance decrease. We discuss the implications of these results on developing user feedback classification models to analyse and extract software requirements.

Evaluating software user feedback classifier performance on unseen apps, datasets, and metadata

Journal

EMPIRICAL SOFTWARE ENGINEERING

Publisher

SPRINGER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Evaluating software user feedback classifier performance on unseen apps, datasets, and metadata

Journal

EMPIRICAL SOFTWARE ENGINEERING

Publisher

SPRINGER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper