☆ 4.5 Article

Software defect prediction using ensemble learning on selected features

INFORMATION AND SOFTWARE TECHNOLOGY (2015)

Journal

INFORMATION AND SOFTWARE TECHNOLOGY

Volume 58, Issue -, Pages 388-402

Publisher

ELSEVIER

DOI: 10.1016/j.infsof.2014.07.005

Keywords

Defect prediction; Ensemble learning; Software quality; Feature selection; Data imbalance; Feature redundancy/correlation

Funding

King Fand University of Petroleum and Minerals

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Context: Several issues hinder software defect data including redundancy, correlation, feature irrelevance and missing samples. It is also hard to ensure balanced distribution between data pertaining to defective and non-defective software. In most experimental cases, data related to the latter software class is dominantly present in the dataset. Objective: The objectives of this paper are to demonstrate the positive effects of combining feature selection and ensemble learning on the performance of defect classification. Along with efficient feature selection, a new two-variant (with and without feature selection) ensemble learning algorithm is proposed to provide robustness to both data imbalance and feature redundancy. Method: We carefully combine selected ensemble learning models with efficient feature selection to address these issues and mitigate their effects on the defect classification performance. Results: Forward selection showed that only few features contribute to high area under the receiver-operating curve (AUC). On the tested datasets, greedy forward selection (GFS) method outperformed other feature selection techniques such as Pearson's correlation. This suggests that features are highly unstable. However, ensemble learners like random forests and the proposed algorithm, average probability ensemble (APE), are not as affected by poor features as in the case of weighted support vector machines (W-SVMs). Moreover, the APE model combined with greedy forward selection (enhanced APE) achieved AUC values of approximately 1.0 for the NASA datasets: PC2, PC4, and MC1. Conclusion: This paper shows that features of a software dataset must be carefully selected for accurate classification of defective components. Furthermore, tackling the software data issues, mentioned above, with the proposed combined learning model resulted in remarkable classification performance paving the way for successful quality control. (C) 2014 Elsevier B.V. All rights reserved.

Software defect prediction using ensemble learning on selected features

Journal

INFORMATION AND SOFTWARE TECHNOLOGY

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Software defect prediction using ensemble learning on selected features

Journal

INFORMATION AND SOFTWARE TECHNOLOGY

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper