4.6 Article Proceedings Paper

Comparing software prediction techniques using simulation

Journal

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING
Volume 27, Issue 11, Pages 1014-1022

Publisher

IEEE COMPUTER SOC
DOI: 10.1109/32.965341

Keywords

prediction system; simulation; machine learning; data set characteristic

Ask authors/readers for more resources

The need for accurate software prediction systems increases as software becomes much larger and more complex. A variety of techniques have been proposed; however, none has proven consistently accurate and there is still much uncertainty as to what technique suits which type of prediction problem. We believe that the underlying characteristics-size, number of features, type of distribution, etc-of the data set influence the choice of the prediction system to be used. For this reason, we would like to control the characteristics of such data sets in order to systematically explore the relationship between accuracy, choice of prediction system, and data set characteristic. Also, in previous work, it has proven difficult to obtain significant results over small data sets. Consequently, it would be useful to have a large validation data set. Our solution is to simulate data allowing both control and the possibility of large (1,000) validation cases. In this paper, we compared four prediction techniques: regression, rule induction, nearest neighbor (a form of case-based reasoning), and neural nets. The results suggest that there are significant differences depending upon the characteristics of the data set. Consequently, researchers should consider prediction context when evaluating competing prediction systems. We also observed that the more messy the data and the more complex the relationship with the dependent variable, the more variability in the results. In the more complex cases, we observed significantly different results depending upon the particular training set that has been sampled from the underlying data set. This suggests that researchers will need to exercise caution when comparing different approaches and utilize procedures such as bootstrapping in order to generate multiple samples for training purposes. However, our most important result is that it is more fruitful to ask which is the best prediction system in a particular context rather than which is the best prediction system.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available