☆ 4.5 Article

Cross-validation based K nearest neighbor imputation for software quality datasets: An empirical study

JOURNAL OF SYSTEMS AND SOFTWARE (2017)

期刊

JOURNAL OF SYSTEMS AND SOFTWARE

卷 132, 期 -, 页码 226-252

出版社

ELSEVIER SCIENCE INC

DOI: 10.1016/j.jss.2017.07.012

关键词

Empirical software engineering estimation; KNN; Imputation; Cross-validation; Missing data

类别

Computer Science, Software Engineering Computer Science, Theory & Methods

资金

General Research Fund of the Research Grants Council of Hong Kong [125113, 11200015, 11208017, 11214116]
Research funds of City University of Hong Kong [7004683, 7004474]
Engineering and Physical Sciences Research Council [EP/J017515/1] Funding Source: researchfish
EPSRC [EP/J017515/1] Funding Source: UKRI

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Being able to predict software quality is essential, but also it pose significant challenges in software engineering. Historical software project datasets are often being utilized together with various machine learning algorithms for fault-proneness classification. Unfortunately, the missing values in datasets have negative impacts on the estimation accuracy and therefore, could lead to inconsistent results. As a method handling missing data, K nearest neighbor (KNN) imputation gradually gains acceptance in empirical studies by its exemplary performance and simplicity. To date, researchers still call for optimized parameter setting for KNN imputation to further improve its performance. In the work, we develop a novel incomplete-instance based KNN imputation technique, which utilizes a cross-validation scheme to optimize the parameters for each missing value. An experimental assessment is conducted on eight quality datasets under various missingness scenarios. The study also compared the proposed imputation approach with mean imputation and other three KNN imputation approaches. The results show that our proposed approach is superior to others in general. The relatively optimal fixed parameter settings for KNN imputation for software quality data is also determined. It is observed that the classification accuracy is improved or at least maintained by using our approach for missing data imputation. (C) 2017 Elsevier Inc. All rights reserved.

Cross-validation based K nearest neighbor imputation for software quality datasets: An empirical study

期刊

JOURNAL OF SYSTEMS AND SOFTWARE

出版社

ELSEVIER SCIENCE INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Cross-validation based K nearest neighbor imputation for software quality datasets: An empirical study

期刊

JOURNAL OF SYSTEMS AND SOFTWARE

出版社

ELSEVIER SCIENCE INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文