4.5 Article

Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

期刊

JOURNAL OF SYSTEMS AND SOFTWARE
卷 81, 期 12, 页码 2361-2370

出版社

ELSEVIER SCIENCE INC
DOI: 10.1016/j.jss.2008.05.008

关键词

Missing data; Missing data toleration; C4.5; Data imputation; Software project cost prediction

资金

  1. National Natural Science Foundation of China [60673124, 90718024]
  2. Hi-Tech Research & Development Program of China [2006AA01Z183]
  3. New Century Excellent Talents in University [NCET-07-0674]

向作者/读者索取更多资源

Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict Cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing Values and then apply the C4.5 algorithm. For the investigation, we simulated three missingness mechanisms, three missing data patterns, and five missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%. (C) 2008 Elsevier Inc. All rights reserved.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据